10 Concepts to Understand Machine Learning

Machine Learning, Artificial Intelligence, Data Science... It can be difficult to navigate the jargon of the "data world".  We have developed for you the ten must-know concepts that will help you understand these topics.

0. Data set

Let's start with the raw material 👉 the data 👈.

A dataset is a set of observations 👀 . These observations can be sound recordings, photographs, textual documents or even characteristics of observed individuals (height, weight, age for example) etc.

In a well-structured dataset, the observations all have the same format and can thus be processed in the same way 👍.

0 bis. Algorithm

An algorithm is a finite and unambiguous sequence of operations allowing to solve a problem 🤯 . 

On appel input l’ensemble des informations de départ et output le résultat de l’algorithme.

1. Machine Learning / Intelligence Artificielle
Machine Learning is a branch of algorithmics in which some of the operations listed are not written by the programmer but arise from an automated statistical observation phase 🚀of a data set: learning (or training).
 
Artificial intelligence is a more vague and less consensual concept 🤷‍♀️. The term is very often used as a synonym for Machine Learning.
2. Data Scientist / Data Engineer / Data Analyst
These new data professions have dynamic boundaries that depend on cultures, companies and even individuals. The exact definitions cannot bring everyone into agreement. But it is safe to say that:
  • the Data Scientist is the most mathematical of the three: his job is to design algorithms through automatic data analysis (learning);
  • the Data Engineer is the one who makes the data accessible, structures it, integrates the work of the Data Scientist into the software, ensures the quality control, manages the infrastructure, deploys new versions, maintenance etc.
  • the Data Analyst, like the Data Scientist, works on data but without the objective of getting an algorithm out of it: he seeks to get information out so that it can be used by humans in decision making. To best achieve this, he must be able to present his analyses in the clearest possible way using Data Visualisation tools. His background in mathematics may be less than that of the Data Scientist and his background in programming and infrastructure may be less than that of the Data Engineer; but the Data Analyst must absolutely be a good communicator and have strong synthesis and presentation skills.
3. Training or learning
This is the core of Machine Learning 💓. It is the phase in which an algorithm will search for a sequence of operations that best achieves a given goal for each point in a dataset.
👉 For example: if I am trying to evaluate the price of an apartment based on a set of criteria such as surface area, location and floor number, the algorithm will seek, during the learning phase, to perform operations based on these criteria with the aim of falling as close as possible to the price 🧿 for each of the flats listed in the data set.
training metaphore
4. Overfitting
An algorithm overlearns or overfits a dataset when it adapts to the specificity of that dataset data 👎 and not to that data’s general distribution.
5. Training set / Validation set / Testing set
(jeu d'entraînement / jeu de validation / jeu d'évaluation)
The crucial point of any development that relies on data is the imperative to randomly separate the data into 3 subsets: a training set, a validation set and a testing set.
  • the training set 🤨 is used for learning: this is the only dataset that must be given to the algorithms and on which the Data Scientist must rely to design his model.
  • the validation set 👍 is there to evaluate the models as they are learned and to compare different or differently parameterised algorithms. One needs a separate dataset from the training set to be sure not to value algorithms that rely on the peculiarities of the training set (i.e. overfitting)
  • the testing set 💪 is there to evaluate the chosen model once at the end and assign a reliable score to the chosen model. The validation set cannot be used for this as it was used to choose the best algorithm (in other words, the algorithm has been chosen for it); therefore there would be a risk of overlearning 🛑.
6. Supervised / unsupervised learning
It is called supervised learning when, during training, the algorithm is provided with the "correct answer" to the question it seeks to answer.
👉 For example: If I want to train an algorithm to guess the age of a person from a photo, I'll give it a set of photos, but also the age of each person pictured, so it can learn to guess. This is supervised learning.
 
It is called unsupervised learning when the algorithm is simply asked to group the data based on their proximity.
👉 For example: If I'm looking to group users by taste profiles based on the videos they've watched on a platform, I'm not going to provide any description of each person's tastes, I'm going to let the viewing similarities emerge and the categories form from the data. This is unsupervised learning.
7. Feature / Feature engineering
A feature is a characteristic of the elements of a dataset. It is the result of "feature engineering"😮, i.e. its design by the Data Scientist, upstream of learning.
👉 For example: let's take a dataset consisting of individuals. Eye colour 👀 and weight are very simple features. The average number of vegetables 🥦 consumed per week or the median daily commuting time 🛴 are other, more complex features.
réseau de neurones
8. Neural network / Deep Learning
It is a type of Machine Learning algorithm inspired by the function of the brain 🧠. It relies on the succession of layers of neurons (the more layers, the deeper the network is said to be; hence the notion of "deep" learning 👌).
9. Data leak
A data leak denotes the presence in the training set of information that one is not supposed to have ⛔️ in the problem to be solved. In other words, the algorithm is allowed to cheat 😨 in training, as a result of which it cannot be trusted.
👉 For example: I want to train a model that differentiates between dogs 🐶 and wolves 🐺. All the wolf pictures at my disposal are wolves in the snow 🐺❄️ and the dog pictures are dogs indoors 🐶🏠. I have a data leak: the presence of snow in the dataset photos indicates that it is a wolf❄️=🐺❓.
10. Data Lake / Data Warehouse

A data lake is a storage space intended to accommodate all the data that it may be useful to accumulate.

It stores raw 🔧 as well as structured 💿 data unlike the Data Warehouse, which is the clean version of the Data Lake. Only pretreated and structured 💿💿💿 data is stored there for future use.

——

For tips, analytics or news about Data Science and Machine Learning, follow me on Linkedin : Marc Sanselme

Vous nous quittez ?

Restez au courant des actualités et des articles de blog
en vous abonnant à notre newsletter !