The Dataiku Application gives the ability to configure the project to the analyst's needs with a visual interface while still being able to go into the flow to customize the project further.

# Connection Settings

![Connection Settings.png](xtOIrGDm7ju6)

The project is initially shipped with all datasets using the filesystem connection. The user can either leave it this way by not modifying the connection settings section or switch to its preferred connection, that has to be either a Snowflake or a PostgreSQL connection. First, select the connection among the list of all available connections, then press reconfigures connection to call the [Reconfigure Flow Connections](scenario:RECONFIGUREFLOWCONNECTIONS) scenario that will switch dataset connections on the flow. The input dataset table name needs to be entered in the next field, and finally, the [Load Data Source](scenario:LOADDATASOURCE) scenario will load the dataset.

# Feature Identification

![Feature Identification.png](Qo2vBqb479Nw)

The input dataset must contain at least two mandatory columns with some additional columns to build the models. The mandatory columns that need to be mapped are the credit event variable, which should be constructed to fit the specifications of the study, and the id column, a unique identifier for each applicant. Then the user can select columns from the dataset that are sensitive and will be treated specifically as explained [here](article:3). All the other columns will be used in the rest of the project for the modeling. Once the fields are specified, the run button will trigger the [Feature Identification](scenario:FEATUREIDENTIFICATION) scenario to map and split the data.

# Feature Filtering

![Feature Filtering.png](1Pe8wiLJUPFi)

The feature filtering takes place in two successive steps. First, the univariate filters are driven by Information Value and Chi-Square p-value. For both these metrics, the user can specify the threshold to discriminate between kept and discarded variables. Then, the correlation filter removes from pairs of correlated variables the ones with the least information value. The user selects the correlation threshold (the absolute correlation is compared to this value) and chooses the method for computing correlation among Spearman and Pearson. The filtering is triggered when the [Filter Features](scenario:FILTERFEATURES) scenario is called through the run button. Finally, when refreshing the page, a few metrics about the filtering are displayed, while the user can dig a bit deeper by clicking on the button directing to the [dashboard](article:26).

# Feature Binning

![Feature Binning.png](chPUk00nd8Xg)

The next step is to bin the features using the weight of evidence and encode the variables using that same metric. The parameters needed are:

- Categorization Threshold: when a variable is numeric, it will be treated as categorical if the number of unique values is under that threshold. Categorical variables are regrouped using the weight of evidence, but there is no monotonicity concept involved, while for numeric variables, a continuous and monotonic relationship between the variable and the weight of evidence is enforced.
- Minimum Share: The minimum share of observation contained in each bin. Bins will be merged until all reach that threshold.
- Minimum Minority Share: The minimum share of minority class contained in each bin. This condition avoids having bins that might not contain enough bad credit observations and, therefore, not be significant.
- Maximum p-value: For numeric binning only, neighboring bins means are tested for equality; whenever the p-value of this test is over the threshold, the bins are merged. 
- IV filter threshold: Similarly to Feature Filtering, a step of filtering also takes place using information value and the threshold set by the user.

The run button will trigger the [Bin Variables](scenario:BINVARIABLES) scenario, and then the user can edit the editable dataset to give more meaningful labels to the bins. Finally, a link to the [dashboard](article:27) is provided.

# Feature Selection

![Feature Selection.png](67HkZqQxLDvQ)

In the feature selection section, the user can specify one of the three available methods for feature selection and the number of selected features desired in the end. The process is launched by calling the [Feature Selection](scenario:FEATURESELECTION) scenario, and information about the selection is available in the [dashboard](article:28).

# Score Card Building

![Score Card Building.png](NDlB1ygCWfIm)

The scorecard is built by first training the logistic regression through the [Train Credit Model](scenario:TRAINCREDITMODEL) scenario. Then the three following parameters are set:

- Base score: The score attained for an expected probability equivalent to the base odds.
- Base odds: The odds when having the base score. An odds of 10 means that you are 10 times more likely to be good than bad; therefore, P(good) = 10/11 and P(bad) = 1/11.
- Points to double odds: The above two parameters set the starting point of the scorecard, while points to double odds adjust for its scale. A value of 50 will mean that when you increase your score by 50 points, your odds are expected to double.

The run button will rebuild the scorecard using the [Build Score Card](scenario:BUILDSCORECARD) scenario, and information about this scorecard will be available in the [dashboard](article:29).

Finally, the API is updated and the webapp refreshed with the third scenario [Update Webapp and API](scenario:UPDATEWEBAPPANDAPI).

# Fairness Analysis

![Fairness Analysis.png](TzXxIRQvJ4gD)

Finally, this last section focuses on the Fairness aspect of the modeling. This analysis is carried out on one sensitive variable at a time. The user selects the specific sensitive variable and then runs the process with the scenario [Analyze Sensitive Variable](scenario:ANALYZESENSITIVEVARIABLE). And finally, they get the result in the [dashboard](article:31).


