# Hands-On: Tune the Model[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#hands-on-tune-the-model "Permalink to this headline")

In the **Machine Learning Basics** series, you built a basic model to classify high revenue customers and looked at a few ways to evaluate its performance.

Tip

You’ll also find this tutorial as part of the Academy course, Machine Learning Basics, which is part of the ML Practitioner learning path.

Because modeling is an iterative process, let’s now turn our attention to improving the model’s results and speeding up the evaluation process.

* In the **Machine Learning Basics (Tutorial)** project, return to the **Models** tab of the *High revenue analysis*.

By default, you’ll be in the tab showing the Results of your training session. This is where you can get a sneak preview of the results of the visual ML diagnostics.

In this lesson, you’ll start in the **Design** tab.

## Configure the Train / Test Split[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#configure-the-train-test-split "Permalink to this headline")

By default, Dataiku randomly splits the first N rows of the input dataset into a training set and a test set. The default ratio is:

* 80% for training, and

* 20% for testing.

This means Dataiku will take the first N rows of the dataset and randomly take 80% of those rows to train the model. This could result in a very biased view of the dataset.

Looking at our dataset, and analyzing the *high\_revenue* column, our target column, we can see that there is a class imbalance.

This could be problematic when taking only the first N rows of the dataset and randomly splitting it into train and test sets. However, since our dataset is small, we’ll keep the default sampling & splitting strategy.

Note

One way to try to improve a class imbalance is to apply a class rebalance sampling method. Visit Settings: Train / Test set to discover how Dataiku allows you to configure sampling and splitting.

## Configure ML Assertions[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#configure-ml-assertions "Permalink to this headline")

One of the ways to streamline and accelerate the model evaluation process is by automatically checking that predictions for specific subpopulations meet certain conditions.

A business analyst has analyzed the relationship between the top two variables from the Variable importance chart, *age\_first\_order* and *pages\_visited\_avg*, and the target, *high\_revenue*, to assert the following:

* When *age\_first\_order* is greater than or equal to 40, the customer is likely to be labeled “high revenue = true” at least 10% of the time.

* When count of *pages\_visited\_avg* is between 6 and 12, the customer is likely to be labeled “high revenue = true” at least 10% of the time.

Rather than having to spot check the predicted results, we can add a conditional statement, known as an ML Assertion, to check that the model is behaving intuitively.

To add assertions:

* In the **Design** tab, locate the **Basic** section.

* Choose **Debugging**, then scroll down or zoom out to view **Assertions**.

* Select **Add An Assertion** to add the first assertion.

* Configure the following conditional statement:

* On rows that satisfy **all the following conditions**

+ *age\_first\_order* **>=** `40`

+ **Expected class** is `True`

+ With a valid ratio greater than or equal to `10%`.

* Select **Add Another Assertion**.

* Configure the following conditional statement:

* On rows that satisfy **a formula**

* Type the formula below:

§ pages\_visited\_avg >= 6 && pages\_visited\_avg <= 12

* Ensure that **Expected class** is set to `True`.

* Set the valid ratio greater than or equal to `10%`.

* Save your changes.

Now, whenever we train the model, Dataiku will run ML diagnostics including the assertion check we just configured. Then we’ll be able to find the results of our assertion check by visiting the Metrics and Assertions in the Model Performance section.

## Feature Handling[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#feature-handling "Permalink to this headline")

To address the issue about pre-processing of variables before training the model, we’ll use the **Features handling** panel. Here, Dataiku will let you tune different settings.

* Select **Features handling** in the **Features** section.

### Reject Geopoint Feature[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#reject-geopoint-feature "Permalink to this headline")

The **Role** of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model.

Let’s remove **ip\_address\_geopoint** from the model.

* Turn off **ip\_address\_geopoint**.

This action changes the handling of the feature to **Reject**.

### Disable Rescaling Behavior[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#disable-rescaling-behavior "Permalink to this headline")

Each variable type can be handled differently.

The **Type** of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:

* **Numerical** variables are real-valued ones. They can be integer or numerical with decimals.

* **Categorical** variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.

* **Text** is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku is able to handle raw text features with specific preprocessing.

The numerical variables *age\_first\_order* and *pages\_visited\_avg* have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1).

We’ll want to disable this behavior and use **No rescaling** instead.

* Select the checkboxes for the variables *age\_first\_order* and *pages\_visited\_avg*.

Dataiku displays a menu where you can select the handling of the selected features.

* Under **Rescaling**, select **No Rescaling**.

## Feature Generation[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#feature-generation "Permalink to this headline")

Generating new features can reveal unexpected relationships between the inputs (variables/features) and the target.

We can automatically generate new numeric features using **Pairwise linear combinations** and **Polynomial combinations** of existing numeric features.

Note

The **Script** tab of a visual analysis includes all of the processors found in the Prepare recipe. Any features created here can be immediately fed to models. Please review lessons on the Prepare recipe and the Lab if this is unfamiliar to you.

* In the **Features** section, select the **Feature generation** panel.

* Select **Pairwise linear combinations**, then set **Enable** to **Yes**.

* Select **Pairwise polynomial combinations**, then set **Enable** to **Yes**.

## Retrain the Model[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#retrain-the-model "Permalink to this headline")

After altering the model’s settings, you can now train and build some new models.

* Select **Save** and then click **Train**.

* Select **Train** again to start the session.

Once the session has completed, you can see that the performance of the random forest model has now slightly increased.

## Evaluate the Model from Session 2[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#evaluate-the-model-from-session-2 "Permalink to this headline")

Session 2 results in a Random Forest model with an AUC value that is higher than the first model.

### Diagnostics[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#diagnostics "Permalink to this headline")

When training is complete, we can go directly to ML diagnostics.

* Select **Diagnostics** in the **Result** tab of the random forest model to view the results of the ML diagnostics checks.

Dataiku displays **Model Information** > **Training information**. Here, we can view warnings and get advice to avoid common pitfalls, including if a feature has a suspiciously high importance - which could be due to a data leak or overfitting.

This is like having a second set of eyes that provide warning and advice, so that you can identify and correct these issues when developing the model.

### Metrics and Assertions[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#metrics-and-assertions "Permalink to this headline")

Now we can find out if our ML assertion check passed or failed.

* Select **Metrics and assertions** in the **Performance** section.

Dataiku displays the results of the assertion check. We can see whether or not our assertion checks passed, the number of rows matching the criteria, along with the percentage of valid rows.

### Variable Importance[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#variable-importance "Permalink to this headline")

Finally, let’s look at the Variable importance chart for the latest model.

* Select **Variable importance** in the **Explainability** section.

We can see that the importance is spread across the campaign variable along with the features automatically generated from *age\_first\_order* and *pages\_visited\_avg*. The generated features may have uncovered some previously hidden relationships.

Note

You might find that your actual results are different from those shown. This is due to differences in how rows are randomly assigned to training and testing samples.

### Table View[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#table-view "Permalink to this headline")

Now that you have trained several models, all the results may not fit your screen. To see all your models at a glance:

* Go back to the **Result** tab.

* Switch to the **Table** view.

You can sort the Table view on any column, such as ROC AUC. To do so, just click on the column title.

## What’s Next?[¶](https://knowledge.dataiku.com/latest/courses/machine-learning/tune-model/tune-the-model.html#what-s-next "Permalink to this headline")

Congratulations, you just built, evaluated, and tuned your first predictive model using Dataiku!

How do we know, however, if this model to predict high revenue customers is biased? Is it performing similarly for male and female customers, for example?

In Hands-On: Explain Your Model, we’ll spend more time trying to understand and interpret the model’s predictions.
