In this zone, each of the variables are processed through a prepare recipe and a couple of regression splines custom recipes. Finally data is split between train and test set, and the train set is used to engineer the features and build the model and the test set will be kept for validation.

![Feature Processing.png](GajebPNF3F91)

# Initial Cleaning

The feature processing is inspired from the article referenced in [Resources](article:2). In the first [prepare recipe](recipe:compute_claim_frequency_cleaned), a few transformations are done on the data:

Exposure is capped at 1 because the dataset relates to policy claims on one year and exposure should not exceed 1. There are some rare outliers in ClaimNb with values above 4, that we cap at 4. Also, missing values for Claim Amounts are filled with 0.

# Train Test Split

The dataset is split randomly between train and test set, with 90% in the train set and the rest in the test set. The train set will be used to build the graphs that will drive the feature engineering and fit the models. The test set is only used for performance assessment.

# Feature Analysis

In the second [prepare recipe](recipe:compute_claim_frequency_prepared), additional transformations are made on the features to analyze their relationship with the targets. They are done on the training set, to ensure reliability of the test results.

 - The vehicle age is capped at 20, very few rows have values above 20, too few to be significant. 
 - Similarly for driver age, as there are very few observations above 90 years old and under 20 years old, driver age is bounded inside the 20, 90 range. 
 - Bonus malus is the French term for No Claims Discount, the higher the value, the higher the risk. We cap its value to 150. 
 - The density seems distributed log-normally, taking the log of its value create a more balanced distribution, density refers to the density of the area where the policyholder is located. 
 - We also bin values for bonus malus and log density to build agregate metrics on signicant chunks of data. - Operations made on the same variables are grouped together thanks to the grouping capability.

