# Data Processing & Machine Learning Quick Start[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#data-processing-machine-learning-quick-start "Permalink to this headline")

Contents

* Getting Started

* Create and Explore the Project

* Import and Sync Data

* Explore and Analyze Data

* Transform Data — Join

* Transform Data — Prepare

* Design and Train a Machine Learning Model

* Deploy the ML Model and Score a Test Set

* Next Steps

## Getting Started[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#getting-started "Permalink to this headline")

**Dataiku DSS** is a collaborative, end-to-end machine learning (ML) platform that unites data analysts, data scientists, data engineers, architects, and business users in a common space to bring faster business insights.

In this tutorial, you’ll get hands-on practice with Dataiku DSS through importing, cleaning, interpreting, and processing data for the purpose of predicting credit card fraud. You’ll also use Dataiku’s visual machine learning interface to perform AutoML with minimal effort and even customize models in the visual ML interface by leveraging code libraries.

### Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#prerequisites "Permalink to this headline")

To complete this tutorial, you’ll need the following:

* The cardholder\_info CSV Zip file. You’ll upload this file during the tutorial.

* Dataiku DSS - version 9.0 or above (the Free edition is compatible).

* The Reverse Geocoding plugin. If your instance of Dataiku DSS does not already have this plugin, you’ll need to install it. To learn about installing a plugin, visit Installing plugins.

* A SQL connection, such as Snowflake or PostgreSQL.

If you do not already have your own instance of Dataiku DSS (or the Free edition), you can access a DSS instance that satisfies the prerequisites listed above, by starting a free Dataiku Cloud trial from Snowflake Partner Connect. This trial gives you access to an instance of Dataiku Online with a built-in SQL connection — a Snowflake connection.

### Objectives[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#objectives "Permalink to this headline")

Our primary goal is to build and explore a project workflow (**Flow**) that processes input datasets and builds an optimized machine learning model.

You’ll see how Dataiku DSS can be used to meet your data processing and machine learning needs — and more.

#### What We’re Building[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#what-we-re-building "Permalink to this headline")

We’ll be working with an existing project that contains input datasets. We’ll build a data science pipeline by applying data transformations, building a machine learning model, and deploying it to the Flow. At the end of the tutorial, the project Flow will look like this:

The final Flow will contain datasets, recipes, and machine learning processes.

* A **dataset** is represented by a blue square with a symbol that depicts the dataset type or connection. The initial datasets (also known as input datasets) are found on the left of the Flow. In this project, the input datasets are uploaded CSV files.

* A **recipe** in Dataiku DSS (represented by a circle icon with a symbol that depicts its function) can be either visual or code-based, and it contains the processing logic for transforming datasets.

* Finally, the **Machine learning processes** are represented by green icons.

#### How We’ll Build The Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#how-we-ll-build-the-project "Permalink to this headline")

Our goal is to build an optimized machine learning model that can be used to predict whether or not a credit card transaction is fraudulent.

To do this, we’ll transform the input datasets so that they are clean and ready to use for building a binary classification model.

#### How to Navigate in Dataiku DSS[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#how-to-navigate-in-dataiku-dss "Permalink to this headline")

Throughout this tutorial, we’ll be using the top navigation bar and the right panel to navigate and perform actions.

In upcoming sections, we’ll explore the datasets in the Flow, transform them, and build a machine learning model.

## Create and Explore the Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-and-explore-the-project "Permalink to this headline")

In this section, we’ll create and plan our project; identify business needs; and pinpoint the necessary transformations to our input data.

### Open the Dataiku DSS Homepage[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#open-the-dataiku-dss-homepage "Permalink to this headline")

The DSS homepage is the first page you’ll see when you launch Dataiku DSS. From the DSS homepage, you can browse projects, recent items, dashboards, and applications shared with you on the DSS instance.

#### Open the DSS Homepage From the Snowflake Partner Connect Instance[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#open-the-dss-homepage-from-the-snowflake-partner-connect-instance "Permalink to this headline")

If you are getting started from the Dataiku Cloud trial from Snowflake Partner Connect, the first step is to go to your launchpad, where you’ll find your **Snowflake Partner Connect** instance.

* Click **Open Dataiku DSS** to display the DSS homepage.

#### Open the DSS Homepage From Your Instance of Dataiku DSS[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#open-the-dss-homepage-from-your-instance-of-dataiku-dss "Permalink to this headline")

Alternatively, if you are getting started from your instance of Dataiku DSS,

* Sign in to your Dataiku instance to display the DSS homepage.

### Create the Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-the-project "Permalink to this headline")

Once you’ve opened the DSS homepage, you can create the project.

* From the Dataiku DSS homepage, click **+New Project**.

* Choose **DSS Tutorials** > **Quick Start** > **Data Processing and ML (Tutorial)**.

Note

You can also download the starter project from this website and import it as a zip file.

DSS opens the Summary tab of the project, also known as the **project homepage**. This page contains a high-level overview of the project’s status and recent activities.

* Click **Go To Flow** to open up the project workflow, called the **Flow**.

Next, we’ll explore the project, starting with the collaboration features available in Dataiku DSS.

### View a Discussion and the Project Wiki[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#view-a-discussion-and-the-project-wiki "Permalink to this headline")

Dataiku DSS provides many collaboration features that make it easy for team members on the same Dataiku DSS instance to share and communicate.

To help us start analyzing our input datasets, we can explore project comments, descriptions, and features like the project Wiki. Doing this will help us get oriented whenever we open a project. On this project, we don’t have to look far; there is already a discussion on one of our input datasets, *transactions*. This is indicated by a discussion icon.

Let’s view the discussion:

* Select the *transactions* dataset then open the right-side panel.

* Click the **Discussions** icon, and then click anywhere on the text of the discussion to open it.

The discussion displays a message from a business analyst requesting information about the dataset. Specifically, they want to know the meaning of the *authorized\_flag* column. We can check the project Wiki to see if there is any preliminary information about this dataset.

* From the top navigation bar, select the **Wiki** menu, and then click **Wiki**.

Since there is only one article, the “Project Read Me”, Dataiku DSS opens it. This article includes some preliminary information about the input datasets. Using this information, we know that the fraudulent (or unauthorized) transactions are labeled as “0” in the *authorized\_flag* column.

You can read the rest of the wiki’s contents to learn more about the datasets (*transactions* and *merchant\_info*) in the Flow.

In the next section, we’ll import a dataset called *cardholder\_info* into the project and explore it to obtain some insights and check the data quality. Later, we can revisit the discussion and leave a reply.

## Import and Sync Data[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#import-and-sync-data "Permalink to this headline")

In this section, we’ll add a new dataset *cardholder\_info* to the Flow. This dataset contains information useful for analytics including cardholder location and how long each card has been active. We’ll also sync our input datasets to a SQL database.

### Import the Dataset[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#import-the-dataset "Permalink to this headline")

Dataiku DSS allows you to import data from multiple sources into the same project. For example, you can upload files of various formats, connect to SQL and NoSQL databases, connect to Cloud storage, access data from servers, etc. The product documentation provides details on the different ways to connect to data.

We will import the *cardholder\_info\_csv* dataset into the project to complete our input datasets.

* First, download the cardholder\_info.csv.zip file.

* Go to the **Flow**. To do this, you can always use the top navigation bar or the keyboard shortcut `G+F`.

* Click **+Dataset** in the top right corner of the Flow.

* Click **Upload your files**.

* Drop or select the *cardholder\_info.csv.zip* file in the space provided.

* Click the **Format/Preview** tab to preview the file’s content.

In the **Format/Preview** tab, Dataiku DSS has provided values for the file parameters. For example, the **Type** is **Separated values (CSV, TSV, …)**.

Tip

By clicking the dropdown arrow next to the selected type, you can see a list of other supported file types, including SAS database, Parquet, JSON, etc.

Plugins are another way to import files into your DSS project. For example, the SAS Format Reader plugin provides a means to import SAS7BDAT files into DSS.

* Click the **Schema** tab to see the dataset’s schema. So far, the seven columns in the dataset are stored as “strings”.

* Click the **Infer Types from Data** button so that Dataiku DSS can use a sample of the data to infer the storage type of each column.

* Click **Confirm** and wait for the process to complete.

Dataiku DSS has inferred the storage type of the *latitude* and *longitude* columns as “double” and the *fico\_score* and *age* columns as “bigint (64 bit)”.

* The default dataset name *cardholder\_info* is fine, so click **Save** to create the dataset.

### Sync the Datasets to a SQL Database[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#sync-the-datasets-to-a-sql-database "Permalink to this headline")

Moving datasets from Dataiku DSS into a database allows us to leverage in-database computation when performing certain tasks (e.g. rendering charts and executing recipes) in the project.

There are a few ways in Dataiku DSS to move datasets into a database. In this section, we’ll implement one of these ways by using the **Sync** Recipe. Because we already inferred the storage type of each column when we imported the dataset, we don’t have to repeat this step. The Sync recipe will map the storage type in the input dataset to a similar type in the database, in this case, a Snowflake database.

Optionally, you can configure a connection between DSS and any supported SQL database, such as PostgreSQL. The article on Connections to SQL Databases provides more details about how to define a SQL connection.

* Return to the Flow, and click the *cardholder\_info* dataset to select it.

* Open the right panel, and click the **Actions** button.

* Select the **Sync** recipe.

* Keep the default output dataset name *cardholder\_info\_copy*.

* Store the output dataset into the **PC\_DATAIKU\_DB** Snowflake connection (or your configured SQL connection).

* Click **Create Recipe** to open the recipe’s **Configuration** page.

* Click **Run** to run the recipe.

* Return to the Flow and repeat the previous steps to sync the other input datasets using the output dataset names *transactions\_copy* and *merchant\_info\_copy*.

## Explore and Analyze Data[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#explore-and-analyze-data "Permalink to this headline")

Dataiku DSS has many features that help you quickly explore a dataset. In this section, we’ll try out a few, starting with the actions that can be done in the Explore tab.

### The Explore Tab[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#the-explore-tab "Permalink to this headline")

Whenever you open a dataset from the Flow, Dataiku DSS displays the **Explore** tab where we can examine different views of a sample of the dataset, and even obtain some quick statistics on the data sample.

* In the Flow, double-click the *transactions\_copy* dataset to open it.

The Explore tab offers many options for examining your data sample, including selecting which columns to display, switching between table and column view, and getting quick stats for each column.

In this dataset, each row is a transaction. Beneath each column name are the storage type, the inferred semantic meaning, and a data quality bar.

For a column, the column storage type tells you how the data is stored, and how many bytes are allocated for storage. Meanwhile, the meaning gives a rich semantic label to the sample data in the column. Finally, the data quality bar tells you the proportion of records in your sample that are valid for the assigned meaning.

Using the *purchase\_date* column as an example, the values are stored as “string”, the meaning is determined to be “Date”, and the data quality bar is fully green because all the values in the sample are recognized as dates. You can click the meaning to change it or even define a new one.

If a proportion of the values in the sample do not match the detected meaning, a corresponding proportion of the quality bar will be red, while the color will be gray for missing values. Notice that the quality bar of the *authorized\_flag* column is partly gray, as the default data sample shown has some empty values for this column.

#### Configure the Dataset’s Sample[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#configure-the-dataset-s-sample "Permalink to this headline")

In this section, we’ll open the Sample settings of the *transactions\_copy* dataset to explore the available sampling methods and select a different sample that could provide a better understanding of the column.

When exploring and preparing data in Dataiku DSS, you always get immediate visual feedback, no matter the size of the dataset that you are manipulating. To achieve this, Dataiku works on a sample of your dataset.

* Click **Configure sample** to display the **Sample settings**.

The sampling method is set to the first 10,000 records by default. This is the fastest sampling method. However, let’s select another sampling method, which may be slower, but more representative of the data.

* Change the “Sampling method” to **Random (nb. records)**.

* Click **Save And Refresh Sample**.

* Click **Configure sample** again to close the **Sample settings**.

Now we notice that the *authorized\_flag* values are a mix of “1”, “0”, and empty values. The color quality bar for the column also reflects the proportions for the new sample.

#### Perform Quick Analysis of Columns[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#perform-quick-analysis-of-columns "Permalink to this headline")

In this section, we’ll explore the *transactions\_copy* dataset by analyzing some of its columns.

Often you want to perform quick statistical analysis while exploring your data. Using the **Analyze tool** on a column shows the distribution and key metrics that can guide tasks, such as data cleaning or class rebalancing. These statistics can be calculated on the sample or the whole dataset.

* Click the *authorized\_flag* column name, and choose **Analyze** from the menu.

Dataiku DSS detects that this column contains categorical values and displays the percentage of valid, unique, invalid, and empty values for the data sample.

Tip

Clicking the **Numerical** tab will display the distributions, summary statistics, and value counts for the column if the values are interpreted as numerical values. Also, the **Values Clustering** tab provides options to implement fuzzy clustering of the values in the column. This can be useful when working on a text column that contains values varying slightly in their representation.

Furthermore, the percentages of the top values in the sample are displayed. This shows that the majority of the transactions in the sample are flagged as authorized (1) compared to the potentially fraudulent (0) transactions. The dataset is imbalanced — this is very common in machine learning. As we build our data pipeline and prepare the data for training a machine learning model, we’ll need to take this imbalance into account.

When exploring the dataset, you can also click the column name, and select the **Filter** and **Sort** tools to modify the sample that you’re viewing. For instance, let’s say you want to view only the authorized transactions sorted by the *item\_category*.

* Click the *authorized\_flag* column name and choose **Filter** from the menu.

* Click the **Textual Facet Filter** to select it.

* Check the box indicating the value “1” as the value of the authorized flag.

The *authorized\_flag* column now has a Filter icon next to its name, and a filter box is displayed at the top of the table. You can click the dropdown arrow next to the filter box to expand it, and see its details. You can also click the buttons on the box to turn the filter on/off or to delete it.

* Click the *item\_category* column name and choose **Sort** from the menu. This sorts the column in ascending order.

* Click the Sort icon next to the column name to reverse the sort order.

You’ll notice that the *item\_category* column now has the Sort icon next to its name.

Let’s reset the view by deleting the Filter and Sort.

* Click the “delete” button next to the Filter box to reset the view.

* Click the *item\_category* column name and choose **Remove Sort** from the menu.

### The Charts Tab[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#the-charts-tab "Permalink to this headline")

The Charts tab allows you to create visualizations that are saved along with your dataset. You can use charts to explore a dataset. For example, let’s create a simple chart that shows the average *purchase\_amount* for each *merchant\_category\_id* broken down by the *item\_category*.

* Click the **Charts** tab.

* From the panel on the left, drag and drop *purchase\_amount* as the Y variable, *merchant\_category\_id* as the X variable, and *item\_category* for grouping.

* Click the Bar chart icon on the page to select **Stacked** bar from the bar chart options

Dataiku DSS shows a stacked bar chart of the average purchase amount by *merchant\_category\_id* broken down by the *item\_category* for the current sample.

#### Configure the Chart’s Engine and Sample[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#configure-the-chart-s-engine-and-sample "Permalink to this headline")

You can specify the execution engine to be used for creating charts on your dataset. Dataiku DSS will automatically suggest an engine based upon the dataset and sampling settings. The DSS engine is available for all dataset types, while the **In-database** engine is available for some data sources that support SQL queries.

The chart we just created on the *transactions\_copy* dataset used the DSS engine. However, because the dataset is in a Snowflake database, we can change the computation engine to “in-database” so that the computation is done using SQL.

* Click the **Sampling & Engine** tab from the left side panel.

Notice that by default, the chart uses the DSS execution engine and the same sample configured in the Explore tab. By unchecking the selection for the sample, you can see options for specifying a different sample for the chart.

* From the “Execution engine” dropdown menu, select **In-database**. This computation engine will use the full dataset to create the chart.

* Click **Save** and wait to see the updated chart.

You can visit the product documentation to learn more about the Charts tab including the different kinds of charts that you can create.

#### Publish the Chart to a Dashboard[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#publish-the-chart-to-a-dashboard "Permalink to this headline")

Using Dataiku DSS, you can share visual insights with other stakeholders. For example, a project owner could configure a project dashboard so that it displays on the homepage for those who already have access to the project.

Note

Visit the AI Consumer Quick Start–Consume Insights in a Dashboard to learn how you can share elements of your data project with other users, including ones who may not have full access to your project.

To publish the chart:

* In the upper right corner of the chart, click **Publish**.

* In the resulting dialog box that opens up, keep the default selections to add the chart to the existing dashboard.

* Click **Create**.

DSS creates the insight and adds it to the existing dashboard.

* Resize the chart on the slide by dragging the handles.

* **Save** your changes.

You can interact with the chart in the **View** tab (near the top right of your window). You can also add a description to the dashboard by going to the **Summary** tab.

Note

You can also build more advanced visualizations using code, or by sharing a dataset through a plugin with a dedicated visualization tool like PowerBI or Tableau. To find out more, visit Visualization Plugins.

### The Statistics Tab[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#the-statistics-tab "Permalink to this headline")

In the Statistics tab, you can create worksheets that provide a dedicated interface for performing exploratory data analysis (EDA) on datasets. Using this feature, you can:

* Summarize or describe data samples, e.g. using univariate analysis, bivariate analysis, distribution & curve fitting, and correlation matrices.

* Draw conclusions from a sample dataset about an underlying population, e.g. using hypothesis testing.

* Visualize the structure of the dataset in a reduced number of dimensions, using principal component analysis.

Let’s say we have some information that the mean of the *purchase\_amount* is around 230. We can test the null hypothesis that the mean of *purchase\_amount* is 230 by going to the Statistics tab of the dataset.

* Return to the Flow and open the *transactions\_copy* dataset.

* Click the **Statistics** tab.

* Click **Create Your First Worksheet**, and select **Statistical tests**.

* Click **Student t-test**.

* Specify *purchase\_amount* as the “Variable” and `230` as the “Hypothesized mean”.

* Click **Create Card**, and wait for the Card to be displayed.

The card displays a summary of the *purchase\_amount* variable, including its mean, the tested hypothesis, results of the test, and a plot of the distribution for the test statistic. The card also displays a conclusion about the test — in this case, the test is inconclusive.

Tip

You can select a different sampling setting for the data used in the statistics card by clicking the **Sampling and filtering** drop down at the top of the worksheet. You can also change the **Confidence level** from the top of the worksheet, thereby changing the significance level for the tests in the worksheet.

You can visit the product documentation to learn more about interactive statistics in Dataiku DSS.

Finally, the other tabs of the dataset are also useful for understanding the data — the **Status** tab allows you to compute metrics, checks, and statistics on the dataset; the **History** tab shows a Git-history of tasks that you’ve performed on the dataset; and the **Settings** tab shows details about the dataset’s connection, schema, and more.

## Transform Data — Join[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#transform-data-join "Permalink to this headline")

One of our goals is to build a machine learning model that can be used to predict if a credit card transaction is fraudulent or not. To do this, we’ll need to feed our model variables (or features) about these transactions. Before we can do this, we need to create a dataset that combines all of the information from the different input datasets.

In this section, we’ll use a visual recipe, the **Join** recipe, to combine the input datasets. We’ll also show how to use a code recipe in the Flow.

### Join Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#join-datasets "Permalink to this headline")

The **Join** recipe in Dataiku DSS allows you to perform inner joins, left joins, full outer joins, etc. You can define the Join conditions using the visual Join interface or SQL.

In this section, we’ll join the *transactions\_copy*, *merchant\_info\_copy*, and *cardholder\_info\_copy* datasets.

* Return to the Flow and click the *transactions\_copy* dataset to select it.

* Open the right panel and choose **Join with…** from the “Visual recipes” section.

Dataiku DSS displays the **New join recipe** window.

* Select *cardholder\_info\_copy* as the second dataset.

* Name the Output dataset `transactions\_joined`.

* Store the output dataset into the **PC\_DATAIKU\_DB** Snowflake connection (or your configured SQL connection).

* Click **Create Recipe**.

Dataiku DSS displays the Settings for the Join step. We see that Dataiku DSS has selected a Left join by default and detected the columns on which to join: *card\_id* from both the *transactions\_copy* dataset and the *cardholder\_info\_copy* dataset.

Tip

Sometimes, Dataiku DSS may guess the wrong columns to use for the Join. In such cases, you can click the “equality sign” to open the Join window, and then specify the correct columns.

* Click **+Add Input** to add a “New input dataset”, *merchant\_info\_copy*.

* Click **Add Dataset**. Dataiku DSS has detected the correct columns on which to perform the join.

Next, we’ll review the selected columns and select the columns to include in the output dataset.

* Click the **Selected columns** step in the left panel.

* In the column for the *cardholder\_info\_copy* dataset, add the prefix `card`.

* In the column for the *merchant\_info\_copy* dataset, add the prefix `merchant`, and then select the *merchant\_latitude* and *merchant\_longitude* columns.

* **Save** the recipe. Dataiku warns you about a change in the schema of the output dataset.

Note

* **Schema change popup warning message**: When building out a project, the schema of datasets in the Flow often change, and when they do, you’ll receive notifications to update the schema. This is expected. When a Flow is deployed to production, though, a schema change could have downstream consequences, and so these warnings serve as helpful alerts.

* Click **Update Schema** to accept the schema change.

Notice that Dataiku DSS has selected the “In-database” execution engine to run this recipe because its input and output datasets are all SQL-based, and also because the Join is a SQL-compatible operation.

* **Run** the recipe to build the output dataset, *transactions\_joined*.

* When the Job has successfully completed, click **Explore dataset transactions\_joined**.

* Return to the Flow. The Flow now contains the icon for the visual Join recipe and its output dataset.

#### Computation Engines[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#computation-engines "Permalink to this headline")

Let’s talk briefly about computation engines in Dataiku DSS. In DSS, computations can be done in various places, such as in-memory, in-database, in Kubernetes/Docker, etc. The choice of the **Computation engine** used when performing operations in Dataiku DSS depends on the datasets used and the operation that is being applied to them.

For instance, in the case of the Join recipe used in our Flow, its input datasets are stored in a SQL database. When we run the Join recipe in-database, Dataiku DSS sends a query to the SQL database to read the input datasets, perform the SQL query, and finally write the output dataset if it is a SQL dataset. This way, all the computation is performed in-database. This architecture helps to reduce the risk of computing for too long or running out of memory — issues that can arise when large datasets are stored in-memory and computed in-memory. The article Where does it all happen? provides additional details about computation engines.

Finally, DSS can scale most of its processing by pushing down computation to Elastic computation clusters powered by Kubernetes. See the product documentation on Elastic AI computation for more information.

Next, let’s take a look at code-based recipes in Dataiku DSS and see how to create one for use in our Flow.

#### Convert Visual Join Recipe to Code Recipe[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#convert-visual-join-recipe-to-code-recipe "Permalink to this headline")

Apart from visual recipes, such as the Join recipe that we just saw, Dataiku DSS has code-based recipes. These recipes allow you to execute pieces of code that are defined using languages such as Python, R, SQL, etc. We’ll make use of a SQL recipe in this section of the tutorial.

We could create a SQL recipe from scratch by selecting a dataset from the Flow and selecting the **SQL** recipe from the “Code recipes” section of the right panel. However, let’s create the SQL recipe by converting the visual Join recipe that we just used into a code-based one.

* From the Flow, double click the visual Join recipe to open its Settings page.

* From the left side panel, go to the **Output** step.

At the Output step, there are buttons to **View Query** and **Convert to SQL Recipe**. These options are available because the input datasets are in a SQL database. Let’s first view the query.

* Click **View Query** to see the SQL query that Dataiku DSS generated from the specifications we made in the user interface.

* Click **Convert to SQL Recipe**.

Dataiku DSS warns you that this action is irreversible and will prevent you from further using the visual editor for this recipe.

* Click **Confirm** to proceed.

Dataiku DSS opens up the Code recipe. Here, you can edit the SQL query if needed and run it to propagate your changes. For this tutorial, we won’t be modifying the code.

* Return to the Flow. The Visual join recipe icon has now changed to a SQL recipe icon.

## Transform Data — Prepare[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#transform-data-prepare "Permalink to this headline")

Now that we’ve joined our input datasets, we’ll continue transforming our data to create features that will be used by our machine learning model. Specifically, we’ll apply date formatting, ID handling, and geographic processing. We’ll accomplish all of these in a single recipe, the **Prepare** recipe.

### Format Dates[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#format-dates "Permalink to this headline")

The **Prepare** recipe in Dataiku DSS comes with many processors that you can use to create data cleansing, normalization, and enrichment scripts in a visual and interactive way. You can apply the Prepare recipe to any dataset in the Flow.

In this section, we’ll apply date formatting and calculations to several columns in our joined dataset. We’ll also use a formula to create a new column.

#### Create the Prepare Recipe[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-the-prepare-recipe "Permalink to this headline")

* In the Flow, click the *transactions\_joined* dataset once to select it.

* Open the right-side panel, and choose the **Prepare** recipe from the Visual recipes.

* Name the output dataset `transactions\_joined\_prepared`.

* Store the output dataset into the **PC\_DATAIKU\_DB** Snowflake connection (or your configured SQL connection).

* Click **Create Recipe**.

Dataiku DSS displays the **Script** tab of the *compute\_transactions\_joined\_prepared* recipe.

#### Parse Dates[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#parse-dates "Permalink to this headline")

Our first task is to parse (or format) the dates of the *card\_first\_active\_month* column to a standard date format so that we can make calculations with it.

* Go to the column *card\_first\_active\_month*.

To easily find columns, especially when working in a dataset with many columns, you can press **C** on your keyboard to display the column search, then start typing *card* to search for the *card\_first\_active\_month* column.

* Click the column header to view the dropdown menu.

Dataiku DSS displays suggested actions that are contextual — that is, based on the meaning detected from the sample of each column. In this case, DSS has detected unparsed dates in this column. Therefore, one of the suggested actions is “Parse date”.

* Click **Parse date**.

DSS has detected two possible formats for this date column. We could even add our own custom format here.

* Click **Use Date Format** to parse the date using the first detected format “yyyy-MM”.

DSS adds a new column, and names it *card\_first\_active\_month\_parsed* by default. To parse the *card\_first\_active\_month* column in place (i.e., replace our original column of values), we’ll leave the “Output column” field blank in the left side panel.

* Delete the column name `card\_first\_active\_month\_parsed` from the “Output column” field in the script.

Tip

In the visual data preparation, all transformation steps that you define are executed on a *sample* of your data immediately so that you can preview the results. When you run the visual recipe to insert your preparation script into your Flow, DSS chooses the best computation strategy (e.g. in-database (SQL) or Spark) that is available to process the whole input dataset.

* In the same way, parse the date in the *purchase\_date* column in-place, selecting the **yyyy-MM-dd HH:mm:ss** date format.

#### Extract Date Components[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#extract-date-components "Permalink to this headline")

Now that we have parsed the date in *purchase\_date*, we can extract date components from it.

* Go to the parsed column *purchase\_date*. Remember we parsed in-place, so the contents of *purchase\_date* are now parsed.

* Click the column header to view the dropdown menu and suggested actions.

Based on the values in the column, Dataiku DSS displays suggested actions that include “Extract date components”.

* Click **Extract date components**.

* In the Script, define the output columns by setting the:

+ “Year” column to `purchase\_year`.

+ “Month” column to `purchase\_month`.

+ “Day” column to `purchase\_day`.

+ “Day of week” column to `purchase\_dow`.

+ “Hour” column to `purchase\_hour`.

* Click the *Extract date* step in the left side panel to collapse it.

#### Compute a Column using a Formula[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#compute-a-column-using-a-formula "Permalink to this headline")

Next, we want to create a column, *purchase\_weekend*, using Dataiku’s formula language. This column will identify which credit card transactions occurred on a weekend.

* Click **+ Add a New Step** in the Script.

Dataiku DSS displays the Processors library. The product documentation contains a reference of processors available in the Prepare recipe.

* In the Processors library, search for and select the “Formula” processor.

* Name the Output column `purchase\_weekend`.

* Click **Open Editor Panel**, and type the expression: `if(purchase\_dow>5,1,0)`. As you type, Dataiku DSS provides suggestions for autocomplete, validates the formula, and shows a preview of the sample output.

The “Day of week” column *purchase\_dow* identifies Saturday and Sunday as 6 and 7. This expression labels those days as the weekend.

* Click **Apply**.

* Collapse the step by clicking on it.

#### Compute the Time Difference Between Two Columns[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#compute-the-time-difference-between-two-columns "Permalink to this headline")

Now that we have parsed *card\_first\_active\_month* and *purchase\_date*, we can use the standardized dates to compute their time difference. This information could be useful in analyzing credit card fraud patterns and could be a useful feature for training our machine learning model.

* Go to the column *card\_first\_active\_month*, and click the header to view the dropdown menu.

* Click **Compute time since**.

* In the script, specify values for the parameters as follows, keeping or changing the default values as needed:

>

>

> 	+ “Time since column”: `card\_first\_active\_month`

> 	+ “Until”: **Another date column**

> 	+ “Other column”: **purchase\_date**

> 	+ “Output time unit”: **Days**

> 	+ “Output column”: `days\_active`.

>

* Collapse the step by clicking on it.

Before running the recipe, notice that although the input and output tables are both in the same SQL connection, Dataiku DSS has selected the DSS Local stream to execute this recipe, rather than the In-database computation. To understand the reason for this,

* Click the engine icon next to “Local stream” to open the Recipe engine window.

Dataiku informs you that one of the processors used in the Prepare recipe’s script cannot be translated to a SQL query. Therefore the recipe cannot be fully run in-database.

* **Close** the Recipe engine window.

* Click **Run**, accepting the schema update.

* Wait for the job to complete.

You’ve seen how to use the Prepare recipe on time-based columns. However, working with time-based columns, or more generally, time series datasets in Dataiku DSS is not limited to the Prepare recipe. Dataiku DSS also has a time series preparation plugin and a time series forecast plugin that are useful for time series analysis. To learn about time series in Dataiku DSS, visit the product documentation on Time Series or check out the various knowledge base articles on time series.

#### Group Similar Steps Together[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#group-similar-steps-together "Permalink to this headline")

Since we want to continue working in this Prepare recipe to add further transformations, let’s group our steps so that they are easier to manage.

* In the Script, select all steps by clicking the “Actions” checkbox located above the first step.

* Click the **Actions** dropdown menu, and then choose **Group**.

* Name the group something like `Date Processing`.

We’ll continue working on this Prepare recipe in the next section.

### Compute Geographic Features[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#compute-geographic-features "Permalink to this headline")

Using the merchant and cardholder geographical data, we can compute the distance between card location and merchant location.

To accomplish this, we’ll use geographical processors available in the Prepare recipe: **Create GeoPoint**, **Reverse-geocode**, and **Compute distance**.

#### Create GeoPoints[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-geopoints "Permalink to this headline")

In this section, we’ll create merchant and cardholder GeoPoints. To make this process more efficient, we’ll first create our merchant GeoPoint, then copy the step, and use it to create our cardholder GeoPoint.

##### Create Merchant GeoPoint[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-merchant-geopoint "Permalink to this headline")

* In the Prepare recipe Script, click **+ Add a New Step**.

* In the processors library, search for `Create Geo`, and then choose **Create GeoPoint from lat/lon**.

* In the Script, define the configuration as follows:

+ Set the “Input ‘latitude’ column” to *merchant\_latitude*.

+ Set the “Input ‘longitude’ column” to *merchant\_longitude*.

+ Name the “Output ‘GeoPoint’ column” `merchant\_location`.

##### Create Cardholder GeoPoint[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#create-cardholder-geopoint "Permalink to this headline")

* Click the **More options** menu (the ellipses) of the last step, and choose **Duplicate step**.

* In the new, duplicated step, define the output columns as follows:

+ Set the Input latitude column to *card\_latitude*.

+ Set the Input longitude column to *card\_longitude*.

+ Name the Output GeoPoint column `card\_location`.

* Collapse the last step by clicking on it.

#### Apply Reverse GeoCoding[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#apply-reverse-geocoding "Permalink to this headline")

Let’s reverse-geocode our geopoint columns. This will allow us to add additional columns to our dataset including merchant and cardholder state.

Note

To use the Reverse-geocoding processor, you must first install the DSS plugin called “Reverse Geocoding”. If you’re using your own DSS instance or an instance on which you have permission to install plugins, follow the instructions in the Installing plugins page of the product documentation.

The Snowflake Partner Connect instance already comes with the necessary plugins installed.

* In the Script, click **+ Add a New Step**.

* In the processors library, search for `reverse` and select **Reverse geocoding**.

Dataiku DSS displays several output columns to configure, but we only need one.

* Set the Input column to *merchant\_location*.

* Go to **Output column for level 4 (region)**, and name it `merchant\_state`.

* Similarly, we’ll reverse geocode our *card\_location*, setting the **Output column for level 4 (region)** to `card\_state`.

* Collapse the last step by clicking on it.

#### Compute the Distance between two GeoPoints[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#compute-the-distance-between-two-geopoints "Permalink to this headline")

With our merchant and card location geopoints computed, we can compute the distance between them.

* In the Script, click **+ Add a New Step**.

* In the processors library, search for `compute distance`, and then select **Compute distance between geopoints**.

* In the Script, define the configuration as follows:

* Configure the “Distance between column” **card\_location** and **Another geopoint column**.

* Set the “Other column” to **merchant\_location**.

* Set the “Output distance unit” to **Miles**.

* Name the “Output column” `merchant\_cardholder\_distance`.

* Collapse the last step by clicking on it.

#### Group the Geo Processing Steps Together[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#group-the-geo-processing-steps-together "Permalink to this headline")

Finally, let’s organize our preparation steps by grouping the geo-processing steps together.

* In the Script, select all the geo-processing steps by clicking their checkboxes.

* Click the **Actions** menu, and then choose **Group**.

* Name the group something like `Geo Processing`.

* **Save** the recipe, accepting the schema update.

* **Run** the recipe.

* Wait for the job to finish, then return to the Flow.

In the next section, we’ll split the *transactions\_joined\_prepared* dataset into a training and test. Then, we’ll be ready to build a machine learning model.

## Design and Train a Machine Learning Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#design-and-train-a-machine-learning-model "Permalink to this headline")

Congratulations! You have prepared the data, and now you are ready to build your machine learning model!

You might recall from the Getting Started section that our main goal is to build a machine learning (classification) model to classify transactions with empty values for the *authorized\_flag* column as potentially fraudulent or not. To build a classification model, we’ll be using Dataiku’s visual machine learning interface. But, first, we need to split the *transactions\_joined\_prepared* dataset.

### Split the Transactions for Modeling and Scoring[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#split-the-transactions-for-modeling-and-scoring "Permalink to this headline")

We’ll split the *transactions\_joined\_prepared* dataset into:

* A *transactions\_known* dataset that contains year 2017 transactions, for which the *authorized\_flag* is known. We will use this dataset to train the ML model.

* A *transactions\_unknown* dataset that contains year 2018 transactions, for which the *authorized\_flag* is unknown. We’ll use the trained ML model to predict the values for the *authorized\_flag*.

To split the dataset, we’ll be using the **Split** recipe.

#### Use a Split Recipe[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#use-a-split-recipe "Permalink to this headline")

To split the transactions into known and unknown datasets:

* From the Flow, click the *transactions\_joined\_prepared* dataset to select it.

* Open the right panel and select **Split** from the Visual recipes.

* In the “New Split Recipe” window, click **+Add** to create an output dataset.

* Name the dataset `transactions\_known`, and store it into the **PC\_DATAIKU\_DB** Snowflake database (or your configured SQL connection).

* Click **Create Dataset**.

* Click **+Add** again, and add another output dataset named `transactions\_unknown` into the Snowflake database.

* Click **Create Dataset**.

* Click **Create Recipe**.

Dataiku DSS displays the **Splitting** step of the Split recipe. To split the transactions, we’ll define a filter based on the *authorized\_flag* column, so that the transactions with empty values are placed in one dataset, while all other transactions are placed in another dataset.

* Click **Define filters**.

* Click **+ Add a Condition**.

* Define the filter to match rows that satisfy **all the following conditions** where *authorized\_flag* **is defined**.

* Be sure the output dataset for this conditional filter is set to *transactions\_known*. Dataiku DSS will put the remaining rows, where *authorized* is not defined, into the *transactions\_unknown* dataset.

* **Run** the recipe. Notice that Dataiku has selected the In-database computation engine for this recipe.

After performing the split, the *transactions\_known* dataset contains all transactions where the authorized flag is known (i.e., the transaction is either “1 - authorized” or “0 - fraudulent”). The *transactions\_unknown* dataset contains all transactions where the authorized flag is not known (i.e., empty).

* Return to the **Flow**.

### Design a Machine Learning Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#design-a-machine-learning-model "Permalink to this headline")

Dataiku DSS provides you with the flexibility of creating ML models from scratch using Jupyter notebooks that support programming languages like Python and R. In addition, you can use the visual ML interface with its built-in algorithms (from libraries such as Scikit Learn, XGBoost, MLlib, Keras, and TensorFlow). Finally, you can even write or import custom algorithms for use in the visual ML interface of Dataiku DSS!

In this section, we’ll begin by showing how you can use a Jupyter notebook to implement machine learning in Dataiku DSS. Afterward, we will use the visual ML interface to show how to use built-in and custom algorithms to create ML models.

#### Using a Jupyter Notebook[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#using-a-jupyter-notebook "Permalink to this headline")

This section will briefly cover code notebooks in Dataiku DSS and how they provide the flexibility of performing experimental work using programming languages such as Python and R in Dataiku DSS.

Tip

Apart from Jupyter notebooks, Dataiku DSS also offers integrations with other popular **Integrated Development Environments (IDEs)** such as Visual Studio Code, PyCharm, Sublime Text 3, and RStudio.

Let’s explore the project’s *custom random forest classification* notebook.

* Click the Code icon (**</>**) in the top navigation bar. This takes you to the Notebooks page where you can see existing notebooks in your project, and create additional notebooks as needed.

* Click **custom random forest classification** to open the existing notebook.

This notebook imports functions from various scikit-learn modules and uses the `dataiku` API.

Note

In parts of DSS where you can write Python code (e.g., recipes and notebooks) the Python code interacts with DSS (e.g., to read, process, and write datasets) using the Python APIs of Dataiku DSS.

Dataiku DSS also has APIs that work with R and Javascript. See Dataiku APIs to learn more.

We’ve used a few of these Python APIs in this notebook. Some of these include:

* The `dataiku` package (in the first cell) exposes an API containing modules, functions, and classes that we can use to interact with objects in our project.

* The `Dataset` class (in the third cell) is used to create a `Dataiku.dataset` object.

* The `get\_dataframe` method (in the third cell) is used to create a Pandas dataframe from the `Dataiku.dataset` object.

The Dataiku API is very convenient for reading in datasets regardless of their storage types.

To implement the classifier, a subset of features from the *transactions\_known* dataset has been selected and preprocessed. These features are then used in a random forest classifier that implements grid search to find the optimal model parameter values. The F1 metric for the classifier is also computed.

Thereafter, the *transactions\_unknown* dataset is used to create the test set. This test set is then used to score the ML model and output the prediction for each transaction.

* Run the cells in the notebook to see the computed F1 metric for the model and the predictions for the test dataset (the last two columns in the *all\_preds* dataframe at the end of the notebook).

Note

You can also build your custom machine learning model (preprocessing and training) entirely within a code recipe in your Flow and use another code recipe for scoring the model. For an example that showcases this usage, see the sample project: Build a model using 5 different ML libraries.

When you’re done exploring the notebook used in this section, it is good practice to unload it so that you can free up RAM. To do this,

* Return to the Notebooks page and click the “X” next to the notebook’s name.

#### Using The Visual ML Interface[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#using-the-visual-ml-interface "Permalink to this headline")

The visual ML interface of Dataiku DSS is a one-stop-shop for preprocessing features, creating models, evaluating model performance, interpreting model behavior, comparing and retraining models, and more!

We want to train our machine learning model using the *transactions\_known* dataset so that we can later score the *transactions\_unknown* dataset.

Machine learning in Dataiku DSS is a two-step process. First, we explore models, design, train, and evaluate them in the **Lab** in an iterative way. Then, once we are satisfied with our best-performing model, we **Deploy** it from the lab to the Flow, where it appears as a **Saved model**.

Let’s get started!

* Return to the Flow, and then click the *transactions\_known* dataset once to select it.

* From the right-side panel, click the **Lab**.

* Under “Visual analysis”, choose **AutoML Prediction**.

* Specify to “Create prediction model on **authorized\_flag**”.

Note

When creating a predictive model, Dataiku allows you to create your model using **AutoML** or **Expert** mode.

In the **AutoML mode**, DSS optimizes the model design for you and allows you to choose from a selection of model types. You can later modify the design choices and even write custom Python models to use during training.

In the **Expert mode**, you’ll have full control over the details of your model by creating the architecture of your deep learning models, choosing the specific algorithms to use, writing your estimator in Python or Scala, and more.

* With “Quick Prototypes” selected, keep the default analysis name, and click **Create**.

We could simply go ahead and train the models that DSS has selected. However, let’s pause to first explore the Design tab and see the selections that were made by the AutoML tool so that we can modify the selections as needed.

* Click **Design** to see the Design tab. Here, you can go through the panels on the left side of the page to view their details. We’ll take a look at some of them.

To begin, the **Target** panel displays the proportions of classes in a sample of the *transactions\_known* dataset, and we see more evidence of the class imbalance — 90% of the target is in class 1 and 10% in class 0.

##### Configure the Train / Test Set[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#configure-the-train-test-set "Permalink to this headline")

* Click **Train/Test Set** to display the settings for the Train / Test Set.

Notice that the ML model will use a sample of the first 100,000 records split randomly so that 80% of the sample goes into the train set and the rest goes to the test set. Since we have a class-imbalance problem, the default sampling & splitting strategy isn’t optimal. Let’s try to improve it.

* Set the **Sampling method** to **Class rebalance (approx. nb. records)**

* Set the **Column** to use in rebalancing to `authorized\_flag`.

* **Save** your changes.

Tip

If Dataiku DSS displays the error, “Invalid argument”, letting you know the column chosen for class rebalancing does not exist, then check to make sure that you spelled the column name correctly.

##### Select an Evaluation Metric[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#select-an-evaluation-metric "Permalink to this headline")

* In the **Basic** section of the left side panel, click **Metrics**.

Notice that Dataiku DSS has chosen to optimize model hyperparameters for the AUC metric and optimize the threshold for scoring the target class according to the F1 Score metric. You can change these selections to use other metrics such as Accuracy, Cost matrix, etc. For example, we’ll change the selection so that we also optimize model hyperparameters for the F1 Score metric.

* Specify the value of “Optimize model hyperparameters for” as **F1 Score**.

##### Preprocess Features[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#preprocess-features "Permalink to this headline")

* In the **Basic** section of the left side panel, click the **Features handling** panel to view the preprocessing.

Notice that Dataiku has rejected a subset of these features that won’t be useful for modeling. For the enabled features, Dataiku already implemented some preprocessing:

* For the numerical features: Imputing the missing values with the mean and performing standard rescaling

* For the categorical features: Dummy-encoding

Note

You could change the feature handling methods to one of the other available predefined options or fully customize the feature handling method for each feature by selecting **Custom preprocessing**. This will open up a code editor for you to write Python code for preprocessing the feature.

The Academy course on Custom ML Models covers custom preprocessing and modeling in the visual ML interface. You can also see the product documentation for how to write custom models.

For our design, we’ll keep the default feature handling methods.

Also, the **Feature generation** tab provides options for generating new features based on combinations between features and the interactions between them.

Finally, the **Feature reduction** tab provides options such as Principal Component Analysis and LASSO regression for reducing the dimensionality of the feature space.

##### Choose Algorithms[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#choose-algorithms "Permalink to this headline")

* Click the **Algorithms** panel to see the algorithms available for the design.

Dataiku DSS already selected the **Random Forest** and **Logistic Regression** algorithms for use. You can also enable any of the other available algorithms as desired or use custom algorithms.

Let’s implement a custom Naive Bayes algorithm, using Python code.

* Click **+ Add Custom Python Model** from the bottom of the models’ list.

Dataiku DSS adds a row for “Custom Python model” to the list of algorithms and displays a code sample that imports a classifier from scikit-learn to help you get started.

Note

The **Code Samples** button in the editor provides a list of some of the algorithms you can import from scikit-learn. You can also import a custom ML algorithm that was defined in the project’s library.

Instead of writing our custom algorithm here, let’s import it from the project library.

* Click **Save** to save your model design.

* Go to the code icon **</>** in the top navigation bar, and click **Libraries**.

The **python** folder in the library contains an existing “custom\_models.py” Python file that was pre-made for this project. This Python file implements a Naive Bayes model.

Note

Note that when defining custom algorithms for use in the visual ML interface, the code must follow some constraints depending on the backend you have chosen (in-memory or MLlib).

Here, we are using the Python in-memory backend. Therefore, the custom code must implement a classifier that has the same methods as a classifier in scikit-learn; that is, it must provide the methods `fit()`, `predict()`, and `predict\_proba()` when they make sense.

The Academy course on Custom Models in Visual ML covers how to use custom Python models in greater detail.

Now, let’s return to the visual analysis to finish our model design specification.

* Click the Visual Analyses icon in the top navigation bar.

* Click the ML task **Quick modeling of authorized\_flag on transactions\_known**. This opens the Script of the ML task.

* Click **Models**.

* Click **Design** to see the design settings.

* Click **Algorithms** from the left side panel.

* Click the **Custom Python model** from the list of algorithms.

* Click the pencil icon to rename “Custom Python model” to `Custom Naive Bayes`.

* Delete the code in the editor and type:

§ from custom\_models import NaiveBayesModel

§ clf = NaiveBayesModel()

* Click **Save**.

##### Explore the Other Panels in the Design Tab[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#explore-the-other-panels-in-the-design-tab "Permalink to this headline")

Before leaving the Design page, explore other panels, such as **Hyperparameters**, where you can see the “Grid search” strategy being used.

Also, click the **Runtime environment** panel to see the selected code environment for the visual analysis.

Note

Dataiku DSS allows you to create an arbitrary number of **Code environments** to address managing dependencies and versions when writing code in R and Python. Code environments in Dataiku DSS are similar to the Python virtual environments. In each location where you can run Python or R code (e.g., code recipes, notebooks, and the visual ML interface) in your project, you can specify a code environment that includes the necessary packages for your code to run.

In our example, we’ve imported libraries from scikit-learn, which is already included in the DSS built-in code environment that is selected.

See Setting a Code Environment for details on how to set up Python and R environments and use them in Dataiku DSS objects.

### Train the ML Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#train-the-ml-model "Permalink to this headline")

Now that we’ve customized our design, let’s train the models.

* Click **Train**.

* Name the session `Customized models` and click **Train** again.

The **Result** page for the sessions opens up. Here, you can monitor the optimization results of the models for which optimization results are available. Wait while the training completes; note that this may take a few minutes.

#### View Session Output[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#view-session-output "Permalink to this headline")

During training, the Result tab displays a graph of the evolution of the F1 Score metric during grid search. The grid search option isn’t available to the custom naive Bayes model. However, you can still see it listed along with the other models built during the session.

After training the models, the Result page shows the F1 Score metric for each trained model in this training session, thereby, allowing you to compare performance side by side. In this case ,the Random Forest model is the highest-performing model based on the F1 Score metric.

Note

In the Result tab, Dataiku DSS keeps the history of all our trained models so that we can easily compare models side-by-side and reproduce results. This removes the burden of having to remember the feature selection methods used and model parameters specified alongside performance metrics.

### Assess the Model’s Performance[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#assess-the-model-s-performance "Permalink to this headline")

In this section, we’ll assess the performance of the Random Forest model.

#### View the Model Report[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#view-the-model-report "Permalink to this headline")

* Click **Random Forest (Customized models)** to open it.

Dataiku DSS displays the model’s Report page.

* Explore the model’s report content by clicking any of the items listed in the left side panel of the report page, such as any of the model’s interpretations, performance metrics, or model information.

As an example, the following figure displays the ROC curve plot, which shows the true positive rate vs. the false positive rate resulting from different cut-offs in the predictive model. The larger the area under the curve, the better the model’s performance.

Note

You can automatically generate the trained model’s documentation by clicking the **Actions** button in the top right-hand corner of the page. Here, you can select the option to **Export model documentation**. This feature can help you easily document the model’s design choices and results for better information sharing with your team.

Note

You can visit the course, Machine Learning Basics, to learn more about visual machine learning and to get hands-on practice with lessons like Hands-On: Evaluate the Model.

We could iterate the model’s design further by clicking **Models** in the breadcrumb at the top of the model’s Report page, going to the **Design** tab to change the specifications, and then re-training the models.

## Deploy the ML Model and Score a Test Set[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#deploy-the-ml-model-and-score-a-test-set "Permalink to this headline")

Next, we’ll deploy the model with the highest F1 score (the Random Forest model) to the Flow, and use it to generate predictions for new data that the model has not seen.

### Deploy the Model to the Flow[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#deploy-the-model-to-the-flow "Permalink to this headline")

Let’s deploy the Random Forest model to the Flow.

* From the Report page of the Random Forest model, click **Deploy** near the top right corner, and then click **Create**.

Dataiku DSS deploys the model to the Flow. Notice that there is now a training recipe (green circle), and a deployed model (green diamond) in the Flow.

In the next section, we’ll use the deployed model to score our unknown transactions.

Note

To learn more, visit the Machine Learning course on the Dataiku Academy (registration required).

### Score the Unknown Transactions Dataset[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#score-the-unknown-transactions-dataset "Permalink to this headline")

Now that we have a deployed model in the Flow, we can use it to generate predictions on new, unseen data. To do this, we’ll use the **Score** recipe. The Score recipe requires two inputs: a deployed model and new, unseen data (*transactions\_unknown*).

Let’s use the model to predict if the unknown transactions are fraudulent or not.

* From the Flow, select the deployed model, and add a **Score** recipe from the right panel.

* Choose *transactions\_unknown* as the input dataset.

* Name the output `transactions\_unknown\_scored`.

* Store the output dataset into the **PC\_DATAIKU\_DB** Snowflake connection (or your configured SQL connection).

* Click **Create Recipe**.

* Click **Run** and wait for the job to complete.

* Return to the Flow.

### Inspect the Scored Data[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#inspect-the-scored-data "Permalink to this headline")

Let’s look at the scored data and review the predictions.

* Open the *transactions\_unknown\_scored* dataset, and observe the three new columns appended to the end.

+ *proba\_0* is the probability that a transaction is fraudulent.

+ *proba\_1* is the probability that a transaction is authorized.

+ *prediction* is the model’s prediction of whether the transaction is fraudulent (0) or authorized (1).

## Next Steps[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#next-steps "Permalink to this headline")

Congratulations! You have completed the tutorial! In a short amount of time, you were able to:

* import a CSV file and sync your datasets to a SQL database;

* perform exploratory data analysis;

* transform data and perform computations using the in-database execution engine;

* create a chart, publish it to a dashboard, and collaborate with team members;

* train customized machine learning models in a Jupyter notebook and the visual machine learning interface; and

* deploy a model to the Flow for use in predicting unseen data.

Tip

This quick start tutorial is only the tip of the iceberg when it comes to the capabilities of Dataiku DSS. To learn more, please visit the Academy, where you can find more courses, learning paths, and certifications to test your knowledge.

Now you have a Dataiku Academy project on your Dataiku DSS instance. You may want to manage where it is stored, and be able to easily locate it later on. To do this, you can move the Dataiku Academy project into a project folder. Anyone with the right permissions on the DSS instance can find the project using the Global Search, which searches across the DSS instance.

### Go Further[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#go-further "Permalink to this headline")

#### Package Your Project as a Reusable Application[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#package-your-project-as-a-reusable-application "Permalink to this headline")

By saving your project as a Dataiku application, you can package parts of your Flow into a recipe or package the entire project with a user interface, thereby providing an easy way for others to replicate processes performed in your project. To get hands-on practice with creating your own Dataiku application, visit the Dataiku Applications Tutorial.

#### Operationalization[¶](https://knowledge.dataiku.com/latest/courses/quick-start/data-processing-and-ml/index.html#operationalization "Permalink to this headline")

In this tutorial, we worked on a project that is in development on the Design node. When you are ready to put a project into production, you use the Automation node. To find out more about operationalization, see the articles on the Operationalization page of the Knowledge Base.
