# Data quality assessments (SQL Datasets)[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#data-quality-assessments-sql-datasets "Permalink to this heading")

Pre-requisites

* A working Connection to a SQL database

Data quality is fundamental to the success of businesses, it leads to better decision-making which translates to better service and potential for higher revenues. In practice, making sure that key datasets abide by your project’s quality rules will ensure its robustness. For example, you will probably want to refrain from retraining a model if your training data has too much missing data in specific columns.

In Dataiku, ensuring data quality can be done visually with a combination of Metrics and Checks. You first use Metrics to take measurements on your data then use Checks to make sure those measurements meet some expectation about the data.

While it is possible to implement custom metrics or checks, those still rely on the visual features of Dataiku. For a fully programmatic usage, it is more convenient to implement the same logic using plain Python Recipes. The resulting Flow can then be orchestrated and automated using Scenarios.

In this tutorial, you will implement an example of fully programmatic data quality assessment in a Dataiku Project. You can think of it as a light and custom version of the Metrics and Checks visual features.

## Setting up your project[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#setting-up-your-project "Permalink to this heading")

This tutorial is based on the “Dataiku TShirts” sample Project available directly on your Dataiku platform. This project features a simple data processing pipeline on sales data to eventually build a ML model that predicts sales revenue from new visitors on the website.

* Create the “Dataiku TShirts” sample project.

* Go to the flow and change all non-input datasets to your SQL connection. Your data flow should look like this now:

Note

This tutorial assumes that you are using Snowflake, however any compatible SQL database can also work. You may have to slightly modify the SQL/Python code and use correct data types in the relevant Datasets to comply with your SQL’s flavor syntax.

## Creating metrics and checks[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#creating-metrics-and-checks "Permalink to this heading")

The `web\_last\_month\_enriched` dataset serves as the train dataset for our model. As such, having quality data to feed to the model is of paramount importance. You will now create a python recipe to compute metrics and checks on that dataset.

### Create a Python recipe on the training dataset.[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#create-a-python-recipe-on-the-training-dataset "Permalink to this heading")

Create a python recipe from the `web\_last\_month\_enriched` dataset. Name the output dataset `checks\_web\_last\_month\_enriched`. That output dataset will contain all the results of your checks which in turn will govern whether or not the model should be trained.

### Importing libraries, and defining handles and check functions.[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#importing-libraries-and-defining-handles-and-check-functions "Permalink to this heading")

Replace the recipe’s template code with the following snippet to import the necessary libraries, check functions, input and output datasets. The functions are implementations of numeric range check and value in set check.

§ import dataiku

§ from dataiku import SQLExecutor2

§ import numbers

§ input\_dataset\_name = 'web\_last\_month\_enriched'

§ input\_dataset = dataiku.Dataset(input\_dataset\_name)

§ output\_dataset\_name = 'checks\_web\_last\_month\_enriched'

§ output\_dataset = dataiku.Dataset(output\_dataset\_name)

§ def metric\_in\_numeric\_range(metric\_val=None, maximum=None,

§ soft\_maximum=None, minimum=None, soft\_minimum=None):

§ """

§ Returns OK if a metric value falls within the minimum - maximum range otherwise ERROR

§ Returns WARNING if a metric value falls outside a soft\_minimum - soft\_maximum range

§ """

§ if metric\_val is None:

§ return 'EMPTY'

§ if maximum is None and soft\_maximum is None and minimum is None and soft\_minimum is None:

§ return 'EMPTY'

§ elif isinstance(metric\_val, numbers.Number):

§ if minimum is not None:

§ if metric\_val < minimum:

§ return 'ERROR'

§ if maximum is not None:

§ if metric\_val > maximum:

§ return 'ERROR'

§ if soft\_minimum is not None:

§ if metric\_val < soft\_minimum:

§ return 'WARNING'

§ if soft\_maximum is not None:

§ if metric\_val > soft\_maximum:

§ return 'WARNING'

§ return 'OK'

§ else:

§ return 'WARNING'

§ def metric\_in\_set\_of\_values(metric\_vals=None, admissible\_values=None):

§ """

§ Returns OK if the set of metric values is in

§ the set of allowed values

§ """

§ if not isinstance(metric\_vals, set) or not isinstance(admissible\_values, set):

§ return 'EMPTY'

§ if not len(metric\_vals) or not len(admissible\_values):

§ return 'EMPTY'

§ if len(metric\_vals - admissible\_values):

§ return 'ERROR'

§ else:

§ return 'OK'

### Querying the metrics and checking them.[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#querying-the-metrics-and-checking-them "Permalink to this heading")

In the recipe, you will leverage the SQLExecutor2 module to inject a SQL query into the `web\_last\_month\_enriched` dataset and collect statistics (your metrics) in the form of a Pandas dataframe. Besides computing column statistics, you are also “timestamping” for bookkeeping purposes.

Add the following code to the bottom of your recipe:

§ query\_stats = f"""

§ SELECT

§ current\_timestamp(2) as "date",

§ MIN("pages\_visited") AS "metric\_pages\_visited\_min",

§ LISTAGG(DISTINCT("campain"), ', ') AS "metric\_unique\_campain",

§ COUNT(\*) AS "metric\_rec\_count"

§ FROM "{input\_dataset.get\_location\_info().get('info', {}).get('table')}"

§ """

§ executor = SQLExecutor2(dataset=input\_dataset)

§ df = executor.query\_to\_df(query\_stats)

§ # Checking that metric\_pages\_visited\_min is at least 1

§ df['in\_num\_range\_pages\_visited\_min'] = metric\_in\_numeric\_range(

§ df['metric\_pages\_visited\_min'][0], minimum=1)

§ # CHecking that metric\_rec\_count is greater than 1000

§ df['in\_num\_range\_rec\_count'] = metric\_in\_numeric\_range(

§ df['metric\_rec\_count'][0], minimum=1000)

§ # Check that "metric\_unique\_campain" is either true or false

§ metric\_values = set(df['metric\_unique\_campain'][0].split(', '))

§ admissible\_values = set(['true', 'false'])

§ df['in\_set\_unique\_campain'] = metric\_in\_set\_of\_values(

§ metric\_values, admissible\_values)

§ # write the results of the query to your output dataset

§ output\_dataset.write\_with\_schema(df)

In the above code, note that `LISTAGG` works for Snowflake, Oracle and Db2. For PostgreSQL or SQL Server, use `STRING\_AGG`. For MySQL, use `GROUP\_CONCAT()`.

You should be all set! Run the recipe to make sure everything works - the output should look like this:

## Persisting check results[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#persisting-check-results "Permalink to this heading")

When building datasets, the default behavior in DSS is to overwrite their content. If you want to persist the results from each build, go to the Python recipe, then click **Inputs/Ouputs** and check the **Append instead of overwrite** box. Save.

## Using test results[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/data-quality-sql/index.html#using-test-results "Permalink to this heading")

Suppose that every week, you wish to re-train your revenue-prediction model on newer data. Before doing so, it’s important to check that after every update, your train dataset (`web\_last\_month\_enriched`) meet your data quality requirements and only re-train the model if none of the checks fails.

You can build a scenario to automate this process. After setting a weekly trigger, you would use a Python script (or Python step) to:

* Re-build the `checks\_web\_last\_month\_enriched` dataset recursively

* Retrieve the results of the checks

* Re-train the model only if all the checks passed.
