# Concept definition
[Wikipedia's definition: ](https://en.wikipedia.org/wiki/Leakage_%28machine_learning%29)
> In statistics and machine learning, leakage (also known as data leakage or target leakage) is the **use of information in the model training process that would not be expected to be available at prediction time**, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.

> Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could outperform a leakage-free model.


# When could data leakage occur in our project? 
The ***Product Recommendations*** application has been designed to avoid data leakage in its feature engineering steps. However, you will be free to extend this solution with your flow processing steps and enrichments in the product/services (SKUs) and locations datasets. Beware of data leakage when you will do this. 

*Example of data leakage that could occur in this project: Using customer segment/cluster information that has been computed with the duplicate rows used in the train set*