Data is the raw material to build credit models. It comes from multiple sources and an essential part of this project is to connect to these sources, and combine and process the data to create a single input dataset.

# Data Sources

Data may come from:
- The customer directly: if they have filled in a form or provided some supporting document for a loan application, for instance. In that case, some verification has to be done to check if the data is accurately filled in, to look for inconsistencies, and eventually detect fraud cases.
- Internal systems: the person asking for a credit product might already be a customer, then historical data about the applicant's behavior is available to gain insight.
- External providers: the main source of external information for credit risk is credit bureaus; they collect credit-related data from multiple financial institutions and aggregate them to provide scores and historical time series.
- Others: depending on the type of credit, other sources could be added to the mix, including geographical, sociological, and economic information from research or public sources, for example.

# Data Preparation

From these multiple fields, derived features are built using standard operators such as min, max, sum, first, last, average, and standard deviation. They can be built on whole histories or rolling histories of different fixed lengths, such as 1 month, 2 months, and 6 months. An important challenge is combining all the information related to a single applicant, as one applicant may already hold multiple credit products within the bank or with other financial institutions. Some filtering can also take place to remove anomalies, outliers, or detected fraud, which are not the core interest of the creditworthiness analysis. Creating a pipeline that channels this data from their sources through some cleaning and data preparation to create a unique dataset is a challenging task. It is also essential since the model will not perform well if the data is corrupted or inadequate.

# Target Building

Depending on the type of credit model one wishes to build, the target will differ to represent the actual risk of interest. The timeframe of observation of credit events should also depend on the credit risk that is analyzed, as one might use more extended periods for credit products with longer terms. A credit event can be defined in multiple ways, such as an excessive overdraft, a delay in payment, or a complete default. In the end, the target will simply be a 1 or 0 variable saying if the credit was evaluated as good or bad.

# Observation and Performance Period

The definition of the observation and performance periods are key elements of the dataset building, to make sure that both target and features are built on comparable and meaningful timeframes. The observation period refers to the period when the features are captured and built, while the performance period occurs afterward when the event of credit is eventually observed. For events happening on a long-term timeframe, for instance, it is therefore important to check if the credit events are significantly impacted by external variables that are not captured in the model so that all applicants can really be comparable.


