The Feature Binning Zone handles the Feature Binning and Encoding described in this [previous article](article:8). It contains three threads:

1. Dataset processing and encoding of the variables.
2. Numeric variables handling.
3. Categorical variables handling.

![Feature Binning.png](7Avt4QmCNgSC)

# Binning

The first [python recipe](recipe:compute_applications_binned) is where the binning and encoding occur. The parameters are set in the [Dataiku Application](article:15) through project variables. Categorical and numeric variables receive separate treatments, and numeric variables are turned into categorical variables if the number of distinct values is lower than ```categorization_threshold```.

Categorical variables are binned using the [scorecardpy library](https://github.com/ShichenXie/scorecardpy), with the ```share_bin``` constraint so that final bins should contain at least the predefined share of observations. The weight of evidence encoding is outputted in the binned dataset, the information value of each variable is also outputted for further filtering, and the bin information is created as a dataset.

Numeric variables are binned monotonically using the [monotonic_woe_binning library](https://github.com/jstephenj14/Monotonic-WOE-Binning-Algorithm). Before computing the binning, the sign of the relationship between the feature and the weight of evidence needs to be set. To do so, a simple regression is run on the feature to predict the target, having previously regrouped the feature into quantiles to have a more robust estimation of the average target for each bin. The sign of the regression will determine the sign of the monotonic relationship between the feature and the target. We try the monotonic binning algorithm with the infered sign, but if it fails or the number of bins is only one, we invert the sign to get better bins. The algorithm is parameterized with ```share_bin```, ```share_minority_bin``` and ```p_value```. The definition of these variables is:

- ```share_bin```: minimum share of observations within each bin.
- ```share_minority_bin```: minimum share of minority class (bad credit) within each bin.
- ```p_value```: maximum p-value between neighboring bins.

# Data Handling

In the top thread, variables with an information value above the set threshold are kept. This information value differs from the one computed in the [previous zone](article:18) since the binning has reduced the number of categories. The [applications_binned](dataset:applications_binned) dataset contains all the features encoded with the weight of evidence. The number of variables is trimmed down in the [python recipe](recipe:compute_applications_filtered) according to the variables that have information values above the threshold.

# Numeric Variables

Numeric Variables bins are processed through a [prepare recipe](recipe:compute_woe_bins_numeric_clean) to obtain the [woe_bins_numeric_clean](dataset:woe_bins_numeric_clean) dataset. This dataset is used to build the charts that can be seen in the [dashboard](article:27). There are as many rows as bins for each variable.

# Categorical Variables

Categorical Variables are handled separately, the initial dataset contains as many rows as categories for each variable before the binning takes place. This part's objective is to edit the bins' names to be more easily readable. Initially, the bins' names are the concatenation of all the categories. A recipe leads up to the [woe_bins_categorical_edit](dataset:woe_bins_categorical_edit) dataset, which is editable, where the user can give a meaningful name to each bin. These renamed bins are then joined to the initial dataset to create the [woe_bins_categorical_labels](dataset:woe_bins_categorical_labels) dataset, where the charts for the dashboard are built.


