# Leveraging SQL in Python & R[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#leveraging-sql-in-python-r "Permalink to this heading")

Pre-requisites

* A DSS connection to a SQL database (or the possibility of creating one)

* Have a DSS user profile with the following global permissions:

+ “Create projects”

+ “Write unisolated code”

+ [optional] “Create active Web content”

* [optional] R on DSS

* [optional] A code environment with python version >= 3.6 to install the dash package

* [optional] Basic knowledge of dash

## Introduction[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#introduction "Permalink to this heading")

Structured Query Language (SQL) is a family of languages used to manage data held in relational databases. With SQL, Data practitioners or applications can efficiently insert, transform, and retrieve data.

DSS can translate visual recipes into the SQL syntax of the database that holds the data. This feature lets DSS users easily contribute to the development of efficient ETL pipelines without having to write a single line of SQL. Users can also chose to write SQL query or script recipes for more specific data processing.

Lesser known is the possibility for coders to inject SQL statements through the `SQLExecutor2` module in Python, and the `dkuSQLQueryToData` and `dkuSQLExecRecipeFragment` functions in R. These functions are part of the Python and R dataiku internal package.

Using SQL in Python or R has three main advantages:

The use of a programming language’s flexibility to generate SQL statements (e.g. dynamic queries).

The possibility of further processing the result of a query direcly in Python or R.

The use of the database’s engine.

In this tutorial, you’ll see how you can use Python or R to:

Generate a dataframe from a SQL query in a Python or R notebook.

Execute a SQL statement in a Python or R recipe.

[Python Only] Use SQL in a Dash web app.

In this tutorial, you will get familiar with using SQL in Python or R by working with a modified sample of the notorious New York City Yellow Cab data and the New York City zones data. These datasets contain records from thousands of NYC yellow cab trips in 2020 and a lookup table for New York City neighborhood, respectively. See the Set Up section below for downloading instructions.

## Initial Set Up[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#initial-set-up "Permalink to this heading")

In this section, you will take the necessary steps to have the two above-mentionned datasets ready in a SQL database.

### SQL connection on DSS[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#sql-connection-on-dss "Permalink to this heading")

For this tutorial, you will need a DSS connection to a SQL database. The examples in this tutorial were run on a PostgreSQL 11.14 connection named **pg**. However, any supported databases should do (after potential modification of the SQL syntax shown in the code snippets).

If you have to create a new connection, please refer to our documentation or this Academy course. Note that you will either need to have admin access or the right to create personal connections.

### Create your project and upload the datasets[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#create-your-project-and-upload-the-datasets "Permalink to this heading")

* Create a new project and name it *NYC Yellow Cabs*.

* Download the NYC\_taxi\_trips and the NYC\_taxi\_zones datasets.

* Upload those two datasets to your project. There is no need to decompress the datasets prior to uploading them.

### Build the two SQL datasets[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#build-the-two-sql-datasets "Permalink to this heading")

Now use sync recipes to write both datasets to your sql connection. Name the output datasets *NYC\_trips<* and *NYC\_zones*.

## Notebook examples[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#notebook-examples "Permalink to this heading")

It’s time to explore both datasets in either a Python or R notebook.

* In the top navigation bar, go to *</> -> Notebooks*

* Either create a Python or an R notebook (if R is available on your instance).

Copy and paste the following code. This code injects a string query to the database and collect the result as a dataframe:

§ import dataiku

§ from dataiku.core.sql import SQLExecutor2

§ dataset\_trips = dataiku.Dataset("NYC\_trips")

§ # Instantiates the SQLExecutor2 class which takes the NYC dataset object as a parameter to retrieve the connection details of the dataset. An alternative is to use the `connection=pg` parameter.

§ e = SQLExecutor2(dataset=dataset\_trips)

§ # Get the name of the SQL table underlying the NYC\_trips dataset.

§ table\_name = dataset\_trips.get\_location\_info().get('info', {}).get('table')

§ # Inject the query to the database and returns the result as a pandas dataframe.

§ query = f"""SELECT \* FROM "{table\_name}" LIMIT 10"""

§ df = e.query\_to\_df(query)

§ library(dataiku)

§ # Get the name of the SQL table underlying the NYC\_trips dataset.

§ table\_name\_trips <- dkuGetDatasetLocationInfo("NYC\_trips")$info$table

§ query = sprintf('

§ SELECT \* FROM "%s" LIMIT 10

§ ', table\_name\_trips)

§ # Returns the result of the query as a dataframe. Also needs the connection name.

§ df <- dkuSQLQueryToData(connection='pg', query=query)

The above query may be simple but it shows how Python or R give you the flexibility to generate any query string you want, potentially leveraging the results from other operations. Additionally, you can now anlyze or further process the results of the query using one of the two programming languages.

Eeach row represents a trip. Trip duration is expressed in minutes and any amount in USD. The columns **PULocationID** and **DOLocationID** refers to the pickup and dropoff identifier of the NYC zones in the *NYC\_zones* dataset Running a similar query on the zones dataset returns the following dataframe:

## Code recipe examples[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#code-recipe-examples "Permalink to this heading")

You can also use SQL inside a Python or R recipe. If you don’t need to further transform the results of the query, there is no reason for you to load the results as a dataframe first. Why not be more efficient and run everything in database? You can rely on the `exec\_recipe\_frament()` method from the `SQLExecutor2` in Python or the `dkuSQLExecRecipeFragment` function in R to store the result of the query directly into a SQL dataset on the *pg* connection. For this to work, the output dataset must be a table within the same SQL connection.

Create a Python or R recipe that takes the *NYC\_trips* and *NYC\_zones* datasets as inputs, name the output dataset *NYC\_trips\_zones* and create it within the *pg* connection. Copy and paste the below code and run the recipe.

§ import dataiku

§ from dataiku.core.sql import SQLExecutor2

§ e = SQLExecutor2(connection='pg')

§ zones\_dataset = dataiku.Dataset("NYC\_zones")

§ table\_name\_zones = zones\_dataset.get\_location\_info().get('info', {}).get('table')

§ taxi\_dataset = dataiku.Dataset("NYC\_trips")

§ table\_name\_trips = taxi\_dataset.get\_location\_info().get('info', {}).get('table')

§ query = f"""

§ SELECT

§ "trips"."tpep\_pickup\_datetime" AS "tpep\_pickup\_datetime",

§ "trips"."trip\_duration" AS "trip\_duration",

§ "trips"."tpep\_dropoff\_datetime" AS "tpep\_dropoff\_datetime",

§ "trips"."trip\_distance" AS "trip\_distance",

§ "trips"."fare\_amount" AS "fare\_amount",

§ "trips"."tip\_amount" AS "tip\_amount",

§ "trips"."total\_amount" AS "total\_amount",

§ "PUzones"."zone" AS "pu\_zone",

§ "DOzones"."zone" AS "do\_zone"

§ FROM "{table\_name\_trips}" "trips"

§ LEFT JOIN "{table\_name\_zones}" "PUzones"

§ ON "trips"."PULocationID" = "PUzones"."OBJECTID"

§ LEFT JOIN "{table\_name\_zones}" "DOzones"

§ ON "trips"."DOLocationID" = "DOzones"."OBJECTID"

§ """

§ nyc\_trips\_zones = dataiku.Dataset("NYC\_trips\_zones")

§ # Pass the output dataset object in which to store the result of the query.

§ e.exec\_recipe\_fragment(nyc\_trips\_zones, query)

§ library(dataiku)

§ table\_name\_zones <- dkuGetDatasetLocationInfo("NYC\_zones")$info$table

§ table\_name\_trips <- dkuGetDatasetLocationInfo("NYC\_trips")$info$table

§ query = sprintf('

§ SELECT

§ "trips"."tpep\_pickup\_datetime" AS "tpep\_pickup\_datetime",

§ "trips"."trip\_duration" AS "trip\_duration",

§ "trips"."tpep\_dropoff\_datetime" AS "tpep\_dropoff\_datetime",

§ "trips"."trip\_distance" AS "trip\_distance",

§ "trips"."fare\_amount" AS "fare\_amount",

§ "trips"."tip\_amount" AS "tip\_amount",

§ "trips"."total\_amount" AS "total\_amount",

§ "PUzones"."zone" AS "pu\_zone",

§ "DOzones"."zone" AS "do\_zone"

§ FROM "%s" "trips"

§ LEFT JOIN "%s" "PUzones"

§ ON "trips"."PULocationID" = "PUzones"."OBJECTID"

§ LEFT JOIN "%s" "DOzones"

§ ON "trips"."DOLocationID" = "DOzones"."OBJECTID"

§ ', table\_name\_trips, table\_name\_zones, table\_name\_zones)

§ # Pass the output dataset name in which to store the result of the query.

§ dkuSQLExecRecipeFragment("NYC\_trips\_zones", query)

Note

You could have performed this simple query using a visual join or a SQL recipe. The point here is to show you that you can use code to generate more complex SQL queries. In some instances, code is the most convenient way to write logic beyond what SQL constructs or visual recipes are capable of.

## Dash webapp example[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#dash-webapp-example "Permalink to this heading")

Most websites (or web applications) have to store and serve content. Whether it is to store customer login information, inventory lists or any data to be sent to the users, databases have become an essential part of a web application architecture.

You will now see how to use `SQLExecutor2` in Dash webapp within DSS to visualize the count of trips over time in 2020 from and to user-specified locations.

### Create a Code Environment for your Dash app[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#create-a-code-environment-for-your-dash-app "Permalink to this heading")

* In the top navigation bar, click the *Applications* grid icon.

* Click *Administration->Code Envs->NEW PYTHON ENV*

* Select python>=3.6 (from PATH) and leave all other fields as is. Click create.

* In *Packages to install*, type **dash** under *Requested packages (Pip)*. Click *SAVE AND UPDATE*.

Note

If you need a refresher on code environment creation and packages install, please refer to our documentation.

### Create a Dash Webapp[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#create-a-dash-webapp "Permalink to this heading")

* Head back to your *NYC Taxi* project.

* In the top navigation bar, go to *</> -> Webapps*.

* Click on *+ NEW WEBAPP* on the top right, then select *Code Webapp > Dash*.

* Select the *An empty Dash app* template and give a name to your newly-created Webapp.

* Once create, go to *Settings* and in the *code env* dropdown, click *Select an environment*.

* In the *Environment* dropdown below, select your newly-created code environment.

### Build your Webapp[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#build-your-webapp "Permalink to this heading")

Copy and paste the code below:

§ import dataiku

§ import pandas as pd

§ from dataiku.core.sql import SQLExecutor2

§ import plotly.express as px

§ from dash import dcc, html, Input, Output, State

§ # Collect the zones to populate the dropdowns

§ dataset\_zones = dataiku.Dataset('NYC\_zones')

§ zones = dataset\_zones.get\_dataframe(columns=['zone']).values.ravel()

§ zones.sort()

§ dataset\_trips = dataiku.Dataset("NYC\_trips\_zones")

§ table\_name\_trips = dataset\_trips.get\_location\_info().get('info', {}).get('table')

§ e = SQLExecutor2(connection='pg')

§ # This is the query template. There are two placeholders for the pick-up and drop-off locations in the WHERE clause. These will be populated from the dropdowns' values.

§ query = """

§ SELECT

§ "pickup\_time",

§ COUNT(\*) AS "trip\_count"

§ FROM (

§ SELECT

§ date\_trunc('day', "tpep\_pickup\_datetime") AS "pickup\_time"

§ FROM "{}" WHERE "pu\_zone" IN {} AND "do\_zone" IN {}

§ ) "dku\_\_beforegrouping"

§ GROUP BY "pickup\_time" """

§ app.layout = html.Div([

§ html.H1("Taxi Data"),

§ html.P("From"),

§ # Dropdown for the input location(s).

§ dcc.Dropdown(

§ id="pu\_zone",

§ options=zones,

§ value=None,

§ multi=True,

§ placeholder="Select pick-up location(s)..."),

§ html.P("To"),

§ # Dropdown for the output location(s).

§ dcc.Dropdown(

§ id="do\_zone",

§ options=zones,

§ value=None,

§ placeholder="Select drop-off location(s)...",

§ multi=True),

§ html.Br(),

§ # "Query Trips" button.

§ html.Button('Query Trips', id='submit', n\_clicks=0),

§ html.Br(),

§ dcc.Graph(id="output")

§ ])

§ @app.callback(

§ output=Output('output', 'figure'),

§ inputs=dict(n\_clicks=Input('submit', 'n\_clicks')),

§ state=dict(pu=State('pu\_zone', 'value'),

§ do=State('do\_zone', 'value'))

§ )

§ def get\_query(n\_clicks, pu, do):

§ """ This function is run the user clicks the "Query Trips" button.

§ """

§ if n\_clicks == 0 or pu is None or do is None:

§ return {}

§ pu = str(pu)[1:-1]

§ do = str(do)[1:-1]

§ q = query.format(table\_name\_trips, f'({pu})', f'({do})')

§ # The pick-up and drop-off location(s) are fed into the query placeholders.

§ df = e.query\_to\_df(q)

§ fig = px.line(

§ df, x='pickup\_time', y="trip\_count",

§ labels={"pickup\_time": "date", "trip\_count": "trip count"})

§ return fig

### What does this webapp do?[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/sql-in-code/index.html#what-does-this-webapp-do "Permalink to this heading")

* This webapp lets users input one or more pick-up locations and one or more dropoff locations from two dropdown menus.

* Once the query trips button is clicked, a SQL query is generated with a dynamic WHERE statement to filter on those pick-up and drop-off locations.

* The query is then injected into the underlying table of the *nyc\_taxi\_with\_zones* dataset and returns the count of trips from those pick-up locations to those drop-off locations aggregated at the day level as a pandas dataframe.

* The dataframe is then fed to a *plotly.express* line plot, allowing the users to visualize the results. Keep in mind that this a 10% random sample of the original dataset, so the true daily trip count is roughly 10 times greater than the one reported in the plot. Also, don’t be puzzled by the drastic fall in trip count starting March 2020–that’s what a pandemic does.
