# Hands-On Tutorial: Prepare Your Data[¶](https://knowledge.dataiku.com/latest/courses/basics/prepare-data/prepare-the-data.html#hands-on-tutorial-prepare-your-data "Permalink to this headline")

To complete this lesson, you can continue with the project you created in the Basics 101 course.

Alternatively, you can create a new starter project that picks up where Basics 101 left off. From the Dataiku homepage, select **+New Project > DSS Tutorials > Core Designer / Basics > Basics 102**.

Note

You can also download the starter project from this website and import it as a zip file.

At the end of the last hands-on lesson, we realized our categories of t-shirts needed to be consistently named. We can make this happen with a **Prepare** recipe.

## First Steps[¶](https://knowledge.dataiku.com/latest/courses/basics/prepare-data/prepare-the-data.html#first-steps "Permalink to this headline")

Hint

In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described in each section.

Open the *orders* dataset. Next, click on the **Actions** button or the plus sign at the top-right of the screen to expand the right sidebar.

* In the **Actions** sidebar, choose **Prepare** from the section of Visual recipes.

When creating a recipe, you must provide an input dataset and the name of the output dataset that the recipe will produce.

* Accept the default output dataset name of *orders\_prepared*. Click **Create Recipe**.

Note

You can also set the value of “Store into” to decide where the output data will live. In this example, the output is written to the local filesystem. Alternatively, the output could be written to a relational database or a distributed filesystem if the infrastructure exists.

The **Prepare** recipe allows you to define a series of steps, or actions, to take on the dataset. The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.

* Drag the *order\_id* column in front of the *pages\_visited* column. Note how a step describing this action is added to the recipe’s **Script**.

To standardize the categories of *tshirt\_category*, let’s recode the values.

* Click on the column name *tshirt\_category* and select **Analyze** from the dropdown menu.

The Analyze window provides a quick summary of the column data based on a sample subset of data. It also allows you to perform various data cleansing actions.

* Select **White T-Shirt M** and **Wh Tshirt M**. From the “Mass Actions” dropdown, choose **Merge selected**.

* Choose to replace the values with **White T-Shirt M** and click **Merge**.

* Repeat this process for other categories until six remain.

When all necessary replacements have been made, close the Analyze window and see that a “Replace” step has been added to the Prepare script.

Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the **Analyze** dialog provides a quick and intuitive shortcut to build the step.

Notice that we are now in **Step preview** mode. This means that any changes made by the script step will be highlighted. Above, you can see that the values in blue were modified by the “Replace” step.

After reviewing these changes, click on the **Disable preview** button in the top bar. You will now see your data as it will appear after processing.

*The following video goes through what we just covered.*

## More Preparation[¶](https://knowledge.dataiku.com/latest/courses/basics/prepare-data/prepare-the-data.html#more-preparation "Permalink to this headline")

Now, let’s deal with the *order\_date*. At this point, the storage type of the *order\_date* column is a “string”, but its meaning inferred by Dataiku is an unparsed date. Let’s parse it so that we can treat it as a proper date.

* Open the *order\_date* column dropdown and select **Parse date**.

* The **Smart Date** dialog shows the most likely formats for our dates and what the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in `yyyy/MM/dd` format. Select this format, and see that a “Parse date” step has been added to the script.

By default, this step creates a new column *order\_date\_parsed*. Note how both its storage type and meaning are a date. We could leave the name of the output column empty in order to parse the column in place. Instead, let’s delete the original date column and rename the new column.

* Click on the *order\_date* column header dropdown. Choose **Delete**.

* Click on the *order\_date\_parsed* column header dropdown. Choose **Rename**. Give the name `order\_date`.

*The following video goes through what we just covered.*

Finally, let’s use a **Formula** step to compute the dollar value of each t-shirt order. Dataiku formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.

This time, we will not add the step by clicking on a column header, but instead use the **processors library** which references all 100+ data preparation processors.

* Click the yellow **+Add a New Step** button near the bottom left of the page.

* Select **Formula** (you can search for it).

* Type `total` as the name of the new column.

* In the expression, type `tshirt\_price \* tshirt\_quantity` (you can also select **Open Editor Panel** to use the advanced formula editor, which will autocomplete column names)

* Click anywhere and see the new *total* column appear.

* Remove the columns *tshirt\_price* and *tshirt\_quantity* by clicking on the column header and choosing **Delete**.

Recall that the data visible in the recipe is a preview of what your output dataset will look like based on a sample of data. With our data preparation finished, we must now run the recipe on the whole input dataset.

* Click **Run** in the lower-left corner of the page. Dataiku uses its own engine for this recipe runtime, but depending upon your infrastructure and the type of recipe, you can choose where the computation takes place.

When the job completes, click **Explore dataset orders\_prepared** to view the output dataset. You can also return to the Flow and see your progress.

Note

We can always change the name of our dataset later if we decide on a more descriptive name or if we want to apply a dataset naming convention. Dataiku lets you rename datasets once they are created. To rename a dataset, right-click the dataset to view the context menu or select the dataset in the Flow, and then choose **Rename** in the **Actions** sidebar. Dataiku will let you know if there are any manual adjustments needed elsewhere in the flow.

## Learn More[¶](https://knowledge.dataiku.com/latest/courses/basics/prepare-data/prepare-the-data.html#learn-more "Permalink to this headline")

Congratulations on completing your first Prepare recipe! However, there’s much more data exploration and cleaning to be done. To continue this hands-on tutorial see Hands-On Tutorial: Interactive Visual Statistics.
