Machine Learning, Artificial Intelligence, Data Science... It can be difficult to navigate the jargon of the "data world". We have developed for you the ten must-know concepts that will help you understand these topics.
0. Data set
Let's start with the raw material 👉 the data 👈.
A dataset is a set of observations 👀 . These observations can be sound recordings, photographs, textual documents or even characteristics of observed individuals (height, weight, age for example) etc.
In a well-structured dataset, the observations all have the same format and can thus be processed in the same way 👍.
0 bis. Algorithm
An algorithm is a finite and unambiguous sequence of operations allowing to solve a problem 🤯 .
On appel input l’ensemble des informations de départ et output le résultat de l’algorithme.
1. Machine Learning / Intelligence Artificielle


2. Data Scientist / Data Engineer / Data Analyst
-
the Data Scientist is the most mathematical of the three: his job is to design algorithms through automatic data analysis (learning);
-
the Data Engineer is the one who makes the data accessible, structures it, integrates the work of the Data Scientist into the software, ensures the quality control, manages the infrastructure, deploys new versions, maintenance etc.
-
the Data Analyst, like the Data Scientist, works on data but without the objective of getting an algorithm out of it: he seeks to get information out so that it can be used by humans in decision making. To best achieve this, he must be able to present his analyses in the clearest possible way using Data Visualisation tools. His background in mathematics may be less than that of the Data Scientist and his background in programming and infrastructure may be less than that of the Data Engineer; but the Data Analyst must absolutely be a good communicator and have strong synthesis and presentation skills.
3. Training or learning


4. Overfitting
5. Training set / Validation set / Testing set
(jeu d'entraînement / jeu de validation / jeu d'évaluation)
- the training set 🤨 is used for learning: this is the only dataset that must be given to the algorithms and on which the Data Scientist must rely to design his model.
- the validation set 👍 is there to evaluate the models as they are learned and to compare different or differently parameterised algorithms. One needs a separate dataset from the training set to be sure not to value algorithms that rely on the peculiarities of the training set (i.e. overfitting)
- the testing set 💪 is there to evaluate the chosen model once at the end and assign a reliable score to the chosen model. The validation set cannot be used for this as it was used to choose the best algorithm (in other words, the algorithm has been chosen for it); therefore there would be a risk of overlearning 🛑.

6. Supervised / unsupervised learning
7. Feature / Feature engineering
👉 For example: let's take a dataset consisting of individuals. Eye colour 👀 and weight are very simple features. The average number of vegetables 🥦 consumed per week or the median daily commuting time 🛴 are other, more complex features.

8. Neural network / Deep Learning
9. Data leak

10. Data Lake / Data Warehouse
A data lake is a storage space intended to accommodate all the data that it may be useful to accumulate.
It stores raw 🔧 as well as structured 💿 data unlike the Data Warehouse, which is the clean version of the Data Lake. Only pretreated and structured 💿💿💿 data is stored there for future use.
——
For tips, analytics or news about Data Science and Machine Learning, follow me on Linkedin : Marc Sanselme