## Overview

In [Defect prediction zone](flow_zone:dhNNDiZ), we are training and scoring a model to predict the defects. 
![flow_zone_defect_prediction.png](CCjNphTjsrBE)
Firstly, we use the previously rebalanced dataset of [ the historical data ](dataset:process-data-joined-historical-resampled) to train the model.

It's widespread in quality prediction to have a very imbalanced dataset, as the goal of a manufacturing process is to produce good parts. Imbalanced datasets are a well-known problem, and several options are possible to solve them.

Here, we choose to downsample (or under-sample) the dataset to have more accurate results. Depending on your quality results and process, you might change of strategy. The [Resampling step](recipe:compute_Process_data_joined_quality_filtered) is very important to define your modeling, in our case we choose to rebalance at an approximate ratio of 50/50 which is quite aggressive but works well. The biggest risk would be to miss some behavior and over detect defects.

Another theoretical consideration: we kept the threshold to optimize the F1 score but it's an important point to have in mind: According to your industrial context, you might want to have more false-positive (good parts that are predicted with defects) if the cost of quality control is low compared to the price of a part for example. An interesting Dataiku tool to help you optimize this is the [cost matrix](https://knowledge.dataiku.com/latest/courses/machine-learning/evaluate-model/evaluate-model.html#cost-matrix).

We have trained several models and deployed the Random forest. Nevertheless, XGBoost is very close, could be improved with more features, and is way lighter. As proof, in this example, XGBoost has been trained in 10s where Random forest took almost 50s!


