10 Concepts to understand Machine Learning

Machine Learning, Artificial Intelligence, Data Science… It is sometimes difficult to navigate the jargon of the “data world”. We have developed ten essential concepts for you that will help you understand conversations on these topics.

0. Data set

Let's start with the raw material: 👉 la given 👈.

A dataset is a set of observations 👀 . These observations can be sound recordings, photographs, textual documents or even characteristics of observed individuals (height, weight, age for example) etc.

In a well-structured dataset, the observations all have the same format and can therefore be processed in the same way 👍.

0 bis. Algorithm / Input / Output

An algorithm is a finite and unambiguous sequence of operations to solve a problem 🤯. 

We call input all the initial information and output the result of the algorithm.

1. Machine Learning / Artificial Intelligence
Machine Learning (or automatic learning) is a branch of algorithmics in which part of the operations listed are not written by the programmer but result from a automated statistical observation phase 🚀of a dataset: learning (or training).
Artificial intelligence is a vaguer and less consensual concept 🤷‍♀️. This term is very often used as a synonym for Machine Learning.
Our use cases Default risk prediction ​Scopeo
2. Data Scientist / Data Engineer / Data Analyst
These new data professions have dynamic boundaries that depend on cultures, companies and even individuals. Exact definitions cannot satisfy everyone. But we can safely say that:
  • le Data Scientist is the most maths of the three: his job is to design algorithms through automatic data analysis (learning)
  • le Data Engineer is the one who makes the data accessible, the structure, integrates the work of the Data Scientist in the software, ensures the Quality Control, manages theinfrastructure, the deployment of new versions, maintenance etc.
  • le Data Analyst, like the Data Scientist, works on the data but without the objective of producing an algorithm: he seeks to make get out the information for use by humans in decision making. To achieve this as best as possible, he must be able to present his analyzes as clearly as possible by using Data Visualization. Their background in mathematics may be less than that of the Data Scientist and their background in programming and infrastructure may be less than that of the Data Engineer; but the Data Analyst must absolutely be a good communicator and have strong synthesis and presentation skills.
3. Training or learning
It's the heart of Machine Learning 💓. This is the phase during which an algorithm searches for a sequence of operations making it possible to best achieve a given objective for each point in a data set.
👉 For example: if I seek to estimate the price of an apartment from a set of criteria such as surface area, municipality and floor, the algorithm will seek, during the learning phase, to carry out operations based on these criteria with the aim of fall as close as possible to the price 🧿 for each of the apartments listed in the dataset.
training metaphor
4. Overfitting
An algorithm overlearns or overfits a dataset when it adapts to the specificity of this game of data 👎 and not this type of data.
5. Training set / Validation set / Testing set
(training game / validation game / evaluation game)
Point crucial of any development that relies on data: it is imperative separate randomly the data into 3 subsets: a training set, a validation set and a testing set.
  • le training set 🤨 serves to learning : this is the only data set that must be given to the algorithms and on which the Data Scientist must rely to design his model.
  • le validation set 👍 is here to assess the models As things progress of learning and for can compare different or differently parameterized algorithms. We need a dataset separate from the training set to be sure not to use algorithms that rely on the particularities of the training set (ie overfit)
  • le testing set 💪 is there for evaluate the chosen model only once end and assign a reliable score to the selected model. The validation set cannot be used for this because it was used to choose the best algorithm (in other words, the chosen algorithm is adapted to it); there would therefore be a risk of overlearning 🛑.
6. Supervised / unsupervised learning
We are talking aboutsupervised learning when, during training, we provide the algorithm with the " right answer " to the question we seek to answer.
👉 For example: if I want to train an algorithm to guess the age of a person from a photo, I will provide it with a set of photos, but also the age corresponding to each person represented, so that it can learn to guess . It's learning supervised.
We talk about learning unsupervised when we simply ask the algorithm to group data based on their proximity.
👉 For example: sIf I am looking to group users by taste profiles based on the videos they have watched on a platform, I am not going to provide any description of each person's tastes, I will let the similarities in viewing emerge and the categories form from Datas. It's learning unsupervised.
7. Feature / Feature engineering
A feature is a feature elements of a dataset. This is the result of “feature engineering” 😮, that is to say its design by the Data Scientist, prior to learning.
👉 For example: let's take a dataset made up of individuals. Eye color 👀 and weight are very simple features. The average number of vegetables 🥦consumed per week or the median duration of daily travel time 🛴 are other, more complex features.
neural network
8. Neural network / Deep Learning
It’s a algorithm type of Machine Learning inspired by brain functioning 🧠. It is based on the succession of neuron layers (the more layers there are, the deeper the network is said to be; hence the notion of “deep” learning 👌).
9. Data leaks
A data leak denotes the presence, in the training set, information that we are not supposed to have ⛔️ in the problem to be solved. In other words, we allow the algorithm to cheat 😨 in training, as a result of which he cannot be trusted.
👉 For example: I want to train a model that differentiates dogs 🐶 and wolves 🐺. All the wolf photos I have available are wolves in the snow 🐺❄️ and the dog photos are indoor dogs 🐶🏠. I have a data leak: the presence of snow on the photos in the dataset indicates that it is a wolf ❄️=🐺❓.
10. Data Lake / Data Warehouse

A Data Lake is a storage space intended to accommodate all the data that may be useful to accumulate.

Data is stored there raw 🔧 like data structured 💿 unlike the Data Warehouse, which is the clean version of the Data Lake. Only data is stored there pre-processed and structured 💿💿💿 for their future use.


Follow me on Linkedin for more content like this: Marc Sanselme

Are you leaving us?

Stay up to date with news and blog posts
by subscribing to our newsletter!