The Feature Filtering Zone deals with the initial filtering of variables. The concepts handled here are explained in this [article](article:7). The details of each recipe is described below.

![Feature Filtering.png](UvuCdEsxqnG5)

The first [python recipe](recipe:compute_applications_entropy) computes the statistical measures to run the filtering, information value, and the chi-square statistic p-value. To be computed, the variables are required to be categorical, therefore numeric variables that have more than 6 distinct values are regrouped using 6 quantiles. The measures are output in a dataset along with the thresholds and some metrics about the filtering are saved in project variables. The thresholds for filtering variables are set in project variables through the [Dataiku Application](article:15).

The next dataset [applications_entropy](dataset:applications_entropy) contains the metrics for all variables and is used to build the graphs that are displayed in the [dashboard](article:26). The [filter](recipe:compute_applications_statistical_filter) is applied to keep only the most significant variables. The following [python recipe](recipe:compute_applications_initial_filter) takes the initial dataset and removes the variables that are not included in the filtered dataset just built.

After the univariate filtering, a multivariate analysis is run to remove variables that are too correlated to each other. In the [python recipe](recipe:compute_correlation_matrix), pairs of variables that exhibit absolute correlations over the defined threshold are selected. For each pair, the information value is pulled from the [applications_entropy](dataset:applications_entropy) dataset and the variable that has the lowest information value is dropped. Statistics about the number of variables filtered out are saved in project variables to be displayed in the Dataiku Application, and the output dataset contains only the variables that have passed all the tests.


