# Hands-On Tutorial: Use SQL from a Python Recipe in Dataiku[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#hands-on-tutorial-use-sql-from-a-python-recipe-in-dataiku "Permalink to this headline")

SQL is the most pervasive way to make data analysis queries. However, doing advanced logic like loops and conditions is often difficult in SQL. There are some options like stored queries, but they require learning new languages.

Dataiku lets you run SQL queries directly from a Python recipe. This lets you:

* sequence several SQL queries

* dynamically generate new SQL queries to execute in the database

* use SQL to obtain some aggregated data for further numerical processing in Python

* and much more!

In this tutorial, we are going to use this ability to analyze a dataset from the San Francisco airport.

## Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#prerequisites "Permalink to this headline")

* You’ll need a SQL database configured in Dataiku. We are not going to use very advanced SQL, so any supported SQL database will do.

## Getting Started[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#getting-started "Permalink to this headline")

To get started, create the initial starter project with the data already uploaded.

* From the Dataiku homepage, click **+New Project > DSS Tutorials > Code > SQL in Python (Advanced Tutorial)**.

Note

You can also download the starter project from this website and import it as a zip file.

## The problem at hand[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#the-problem-at-hand "Permalink to this headline")

The dataset records the number of cargo landings and total landed cargo weight at the SFO airport. The dataset contains one record for each month, airline, type and details of aircraft. The line includes the number of aircrafts of this type of this company that landed this month, and the number of records.

The dataset contains data from 2005 to 2015.

What we would like to do is write a recipe to obtain, for each month and airline, a breakdown in several columns of the total landing weight by aircraft manufacturer. In essence, that’s a kind of crosstab / pivot table.

To make things a bit more complex, we are not interested in the small aircraft manufacturers (there are more than 15 in the original data). We only want the top 5 manufacturers. And by top, we mean, “with the highest count of landings”.

With these constraints, doing that in SQL only would be fairly complex. Let’s do it with a little bit of Python calling SQL.

### Input data[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#input-data "Permalink to this headline")

| Month | Airline | Aircraft type | Landings | Total weight |

| --- | --- | --- | --- | --- |

### Output data[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#output-data "Permalink to this headline")

| Month | Airline | Boeing | Airbus | Mc. Donnel |

| --- | --- | --- | --- | --- |

| 201501 | United | Total Boeing weight | Total Airbus Weight | Total Mc. Donnel weight |

## Getting the data ready[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#getting-the-data-ready "Permalink to this headline")

Your Dataiku project has the source data in a Filesystem (or Uploaded) dataset. As this is an import from a CSV file, Dataiku has not automatically typed the input columns: if you go to the settings of the dataset, you’ll see the columns declared as string.

We could set the types manually in the dataset, but we could also let the Prepare recipe do it. Since anyway we need to copy our dataset to a SQL database, using a Prepare recipe instead of a Sync recipe will give us the typing inference for free.

* Go to the Flow view and select your source dataset.

* Create a Prepare recipe from the source dataset

* Choose to store the output in your SQL database connection.

* Let’s name the output dataset `sfo\_prepared`.

The dataset is actually fairly clean, so we won’t need any actual preparation step.

Note

**About invalid values**

If you look at your preparation recipe data table, you’ll see that Dataiku has colored some cells in red. That’s because Dataiku thinks that the *meaning* of the IATA columns is a Country (since many values are valid countries). Dataiku is wrong on this, since not all IATA codes are valid countries.

We could click on the column header and click **Change meaning** to tell Dataiku that this is only text. However, note that Dataiku has already selected *string* as the storage type (since a Country name can only be stored as string anyway). Fixing the meaning makes for a cleaner recipe but is not strictly mandatory

Let’s ``Run`` our preparation recipe (Select “Update Schema” when prompted).

Everything is going well, so we now how our properly typed dataset in our SQL database. Let’s create a Python recipe, and let’s create an output dataset `sfo\_pivot` in the same SQL connection.

## What we are going to do[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#what-we-are-going-to-do "Permalink to this headline")

We want to do our processing in two steps:

* First we will issue a first SQL query to retrieve the 5 top manufacturers by total landing count

* We’ll use that knowledge and a small Python loop to generate the actual pivot query

* We’ll execute the pivot query in SQL and store the results in the output datasets

In order to create the column per manufacturer, we’re going to use a small trick: “CASE WHEN”.

This SQL construct allows you to create conditional operations. To make a column with the total landing weights of Airbus planes only, here is how we would use it:

§ SELECT

§ SUM(CASE WHEN "Aircraft Manufacturer" = 'Airbus'

§ THEN "Total Landed Weight" ELSE 0 END) AS boeing\_weight

For each row, if it is Airbus, we sum the weight, else we sum 0.

Warning

The code below refers to the table as `sfo\_prepared`. If you created the project through the method described above, your table will have the project key prefixed to it. You’ll also need to wrap it in quotation marks, such as “DKU\_TUTORIAL\_SQLINPYTHON\_sfo\_prepared”.

You can also create an SQL notebook and click on the table name to see how it should be referenced if you have trouble.

§ # -\*- coding: utf-8 -\*-

§ import dataiku

§ import pandas as pd, numpy as np

§ from dataiku import pandasutils as pdu

§ # Import the class that allows us to execute SQL on the Studio connections

§ from dataiku.core.sql import SQLExecutor2

§ # Get a handle on the input dataset

§ sfo\_prepared = dataiku.Dataset("sfo\_prepared")

§ # We create an executor. We pass to it the dataset instance. This way, the

§ # executor knows which SQL database should be targeted

§ executor = SQLExecutor2(dataset=sfo\_prepared)

§ # Get the 5 most frequent manufacturers by total landing count

§ # (over the whole period)

§ mf\_manufacturers = executor.query\_to\_df(

§ """

§ select "Aircraft Manufacturer" as manufacturer,

§ sum("Landing Count") as count

§ from sfo\_prepared

§ group by "Aircraft Manufacturer"

§ order by count desc limit 5

§ """)

§ # The "query\_to\_df" method returns a Pandas dataframe that

§ # contains the manufacturers

So we now have a dataframe with the manufacturers, let’s use a small Python loop to generate these pesky case when

§ cases = []

§ for (row\_index, manufacturer, count) in mf\_manufacturers.itertuples():

§ cases.append(

§ """SUM (case when "Aircraft Manufacturer" = '%s'

§ then "Total Landed Weight" else 0 end)

§ as "weight\_%s"

§ """ % (manufacturer, manufacturer))

To finish, we only need to build the final query, execute it, get a dataframe, and store the result in the output

§ final\_query = """select "Activity Period", "Operating Airline",

§ COUNT(\*) as airline\_count, %s

§ from sfo\_prepared

§ group by "Activity Period", "Operating Airline"

§ """ % (",".join(cases))

§ print final\_query

§ result = executor.query\_to\_df(final\_query)

§ output\_dataset = dataiku.Dataset("sfo\_pivot")

§ output\_dataset.write\_with\_schema(result)

We can now run and have a look at our output dataset!

## Look at the output[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#look-at-the-output "Permalink to this headline")

The output is exactly what we wanted.

Let’s not resist making a chart. Let’s do a ``Stacked columns`` chart like this:

Add in “Tooltip” the ``airline\_count (SUM)`` column, click on ``Operating Airline``, and elect to sort by descending airline\_count.

We obtain the following chart.

Fairly unsurprisingly, Boeing is very dominant, and most airlines are mono-carrier. However, United Airlines has a non-negligible Airbus cargo fleet.

## Going further: execute in-database[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html#going-further-execute-in-database "Permalink to this headline")

In this first version, we executed both queries using “query\_to\_df”, meaning that the Python code actually received the whole data in memory, and sent it back to the database for storing the output dataset.

It would be better (and critical in the case of big datasets) that the “final” query be performed fully in-database.

Fortunately, Dataiku makes that easy. Executing SQL queries in-database and handling the work of dropping/creating table is what the SQL query does. The ``SQLExecutor2`` class lets you run a query “as-if” it was a SQL Query recipe.

Let’s just replace the final 3 lines of the code by:

§ output\_dataset = dataiku.Dataset("sfo\_pivot")

§ SQLExecutor2.exec\_recipe\_fragment(output\_dataset, final\_query)

And re-run. Everything works the same, but now, the data has not been streamed to Python. Everything stayed in the database.
