# Hands-On Tutorial: Discover the Lab[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#hands-on-tutorial-discover-the-lab "Permalink to this headline")

Having all of your work in the **Flow** can lead to overcrowding. The **Lab** is the place for experimentation and preliminary work. You can always decide to deploy work you do in the Lab to the Flow.

* In this tutorial, you are going to see how you can perform preliminary work in a dedicated environment called the Lab.

* More specifically, you will create a **Visual analysis** in the Lab.

* Of the three main tabs in a Visual analysis (Script, Charts, and Models), we’ll cover the first two. We’ll discuss the Models tab in the ML Practitioner learning path.

**Prerequisites**

* Hands-On Tutorial: Join Datasets

## Create a Visual Analysis[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#create-a-visual-analysis "Permalink to this headline")

Returning to our Basics 103 project where we joined datasets, let’s create a visual analysis for the *customers\_orders\_joined* dataset.

* Select or open the *customers\_orders\_joined* dataset.

* From the Actions sidebar, click the blue **Lab** button.

* Click **New Analysis**.

Dataiku prompts you to specify a name for your analysis.

* Leave the default name *Analyze customers\_orders\_joined* and create the analysis.

## Interactively Prepare Your Data[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#interactively-prepare-your-data "Permalink to this headline")

Dataiku displays the Script tab for our visual analysis. The Script works the same as the Script in a Prepare recipe.

Hint

Screencasts at the end of sections mirror the actions described.

First, let’s parse the *birthdate* column.

* Open the *birthdate* column dropdown and select **Parse date**.

Based on the sample, Dataiku suggests possible date formats. It’s up to you to validate which one is correct.

* Choose the “yyyy/MM/dd” format.

* Leave the Output column in the Script step empty so the parsed date replaces the original *birthdate* column.

Using the customer’s birth date and the date they made their first order, we can compute the customer’s age when they made their first order.

* From the *birthdate* column dropdown, choose **Compute time since**.

* Choose “until” to be **Another date column**.

* Choose *first\_order\_date* to be the “Other column”.

* Change “Output time unit” to **Years**.

* Then edit the **Output column** name to `age\_first\_order`.

From the new column *age\_first\_order* header dropdown, select **Analyze** in order to see if the distribution of ages looks okay. As it turns out, there are a number of outliers with ages well over 120. Since these outliers represent bad data, we’ll remove them.

* Within the Analyze dialog, click the **Actions** button.

* Choose to clear rows outside 1.5 IQR (interquartile range). This will set those values to missing.

Now the distribution looks more reasonable, but there are still a few suspicious values over 100. We can remove these by setting an upper bound limit. Close the Analyze dialog.

* In the Script, click the new step, **Clear values outside [x,y] in age\_first\_order**, to expand it.

* Set the upper bound to `100`.

Lastly, now that we’ve computed *age\_first\_order*, we won’t need *birthdate* or *first\_order\_date* anymore, so let’s remove them from the script.

* For both columns, open the column dropdown and select **Delete**.

Dataiku adds a **Remove** step to the script.

*The following video goes through what we just covered.*

### Leveraging the User Agent[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#leveraging-the-user-agent "Permalink to this headline")

Let’s now enrich the data by processing the *user\_agent* and *ip\_address* columns.

The *user\_agent* column contains information about the browser and operating system, and we want to pull this information out into separate columns so that it’s possible to use it in further analyses.

* Note that Dataiku DSS has inferred the meaning of the *user\_agent* column to be User-Agent.

* Accordingly, its column dropdown is able to suggest specific actions.

* From the *user\_agent* column dropdown, choose **Classify User-Agent**.

This adds a new step to the script and seven new columns to the dataset.

For this tutorial, we are only interested in the browser and the operating system, so we will remove the columns we don’t need.

* To do this quickly, change from the Table view to the Columns view using the icon near the top right of the screen.

* Select all of the columns beginning with *user\_agent\_*, except *user\_agent\_brand* and *user\_agent\_os*.

* Click the **Actions** button and select **Delete**.

* Switch back to Table view.

*The following video goes through what we just covered.*

### Leveraging the IP Address[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#leveraging-the-ip-address "Permalink to this headline")

Dataiku DSS has inferred the meaning of the *ip\_address* column to be an IP address. Just like with *user\_agent*, we’ll have meaning-specific actions in the column dropdown.

* Open the column header dropdown for the *ip\_address* column.

* Select **Resolve GeoIP**.

This adds a new step to the script and seven new columns to the dataset that tell us about the general geographic location of each IP address.

For this tutorial, we are only interested in the country and GeoPoint (approximate longitude and latitude of the IP address).

* In the Script step, deselect *Extract country code*, *Extract region*, and *Extract city*.

* Finally, delete the *ip\_address* column because we won’t need it anymore.

*The following video goes through what we just covered.*

### Using Formulas[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#using-formulas "Permalink to this headline")

We can use the same Formulas that we use in a Prepare recipe.

Let’s create a new column to act as a label on the customers generating a lot of revenue. We’ll consider customers with a value of total orders in excess of 300 as “high revenue” customers.

* Click **+Add a New Step** in the script and select **Formula**.

* Type `high\_revenue` as the output column name.

* Click **Open Editor Panel** to open the expression editor.

* Type `if(total\_sum > 300, "True", "False")` as the expression.

* Select **Apply**.

Note

The syntax for the **Formula Processor** can be found in the reference documentation.

## Visualize Your Data with Charts[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#visualize-your-data-with-charts "Permalink to this headline")

Now let’s move from the Script tab to the **Charts** tab. At the end of this section, you’ll find a handy screencast that walks through all the steps in case you want to double-check your work.

### Popular User Agents[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#popular-user-agents "Permalink to this headline")

Since we extracted the browsers used by customers from *user\_agent*, it’s natural to want to know which browsers are most popular. A common way to visualize parts of a whole is with a pie or donut chart.

* On the Charts tab, select the chart type tool and choose **Donut**.

* Click and drag *user\_agent\_brand* to the **By** box, and **Count of records** to the **Show** box.

This shows that nearly 3/4 of customers who have placed orders use the Chrome browser. The donut displays the relative share of each browser to the total, but we’d like to include the OS in the visualization.

* Click the chart type tool again.

* Select **Vertical stacked bars**.

* Click and drag *user\_agent\_os* to the **And** box.

* Click on *user\_agent\_brand* to adjust the sorting from “Natural ordering” to “Count of records, descending”.

Adding *user\_agent\_os* gives us further insight to the data. As expected, IE and Edge are only available on Windows, and Safari is only on MacOS. What is enlightening is that there are approximately double the number of customers using Chrome on MacOS as Safari and Firefox combined. There is a similar relationship between use of Chrome versus Firefox on Linux.

### Sales by Age and Campaign[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#sales-by-age-and-campaign "Permalink to this headline")

There are a number of insights we can glean from the combined Haiku T-shirts data that we couldn’t from the individual datasets. For a start, let’s see if there is a relationship between a customer’s age, whether that customer is part of a Haiku T-Shirt campaign, and how much they spend.

Click **+Chart** at the bottom center of the screen.

* From the chart type tool, choose the **Scatter plot** chart.

* Select *age\_first\_order* as the X axis, and *total\_sum* as the Y axis.

* Select *campaign* as the column to color bubbles in the plot.

* Select *count* as the column to set the size of bubbles.

* From the size dropdown to the left of the *count* field, change the Base radius from 5 to 1 to reduce overlapping bubbles,.

The scatter plot shows that older customers, and those who are part of the campaign, tend to spend the most. The bubble sizes show that some of the moderately valued customers are those who have made a lot of small purchases, while others have made a few larger purchases.

### Sales by Geography[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#sales-by-geography "Permalink to this headline")

Since we extracted locations from *ip\_address*, it’s also natural to want to know where Haiku T-Shirt’s customers come from. We can visualize this with a map.

* Click **+Chart** to create a third chart.

* Choose the **Scatter map** plot.

* Select *ip\_address\_geopoint* as the **Geo** field.

* Select *campaign* as the column to color bubbles by.

* Select *total\_sum* as the column to set the size of bubbles.

* From the size dropdown, change the base radius from 5 to 2 to reduce overlapping bubbles.

This looks much better, and you can quickly get a feel for which customers are located where. If we then want to focus on the largest sales:

* Drag *total\_sum* to the **Filters** box.

* Click the number for the lower bound to edit it, and type `300` as the lower bound. This filters all customers who have spent less than 300 from the map.

*The following video goes through what we just covered.*

## Deploy work in the Lab to the Flow[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#deploy-work-in-the-lab-to-the-flow "Permalink to this headline")

When working on charts in a visual analysis, you are building charts with a **sample** of your data. You can change the sample in the **Sampling and Engine** tab in the left panel, but since Dataiku DSS has to re-apply the latest preparation each time, it will not be very efficient for very large datasets.

In addition, if you want to share these charts with your team on a dashboard, you will first need to **deploy your script**. Let’s deploy our script now.

* From any tab in the visual analysis, go to the top right corner of the screen and click on **Deploy Script**. A dialog appears to deploy the script as a Prepare recipe.

* Note that, by default, charts created in the Lab will be carried over to the new dataset so that you can view them on the whole output data, rather than a sample.

* Rename the output dataset `customers\_labelled`.

* Click **Deploy** to create the recipe.

* Save the recipe and go to the Flow.

The white square with a dashed line means the instructions for building a dataset are now available in the Flow. The dataset is not yet built. Let’s build it.

* Open the dataset and see that it is empty. This is because we have not yet run the recipe to build the full output dataset.

* Click **Build**.

This opens a dialog that asks whether you want to build just this dataset (non-recursive) or reconstruct datasets leading to this dataset (recursive). Since the input dataset is up-to-date, a non-recursive build is sufficient.

* Click **Build Dataset** (leaving non-recursive selected).

While the job executes, you are taken to the detailed activity log.

* When the job completes, click **Output dataset** to view a sample of the output dataset.

Let’s configure the stacked bar chart to use the entire dataset.

* Go to the **Charts** tab of the *customers\_labelled* dataset.

* Click **Sampling & Engine** from the left panel.

* Deselect **Use same sample as explore**.

* Select **No sampling (whole data)** as the sampling method.

* Click **Save and Refresh Sample**.

*The following video goes through what we just covered.*

## What’s Next?[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/lab/lab.html#what-s-next "Permalink to this headline")

Congratulations! You successfully deployed a visual analysis script from the Lab to the Flow. Be sure to review the concept materials for greater discussion on the differences between these two.

Now that the *orders* and *customers* datasets are joined, cleaned, and prepared, you would be ready to build a model to predict customer value. This is a task for the ML Practitioner learning path!
