# Datasets (other operations)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#datasets-other-operations "Permalink to this headline")

Please see Datasets (introduction) for an introduction about interacting with datasets in Dataiku Python API

This page lists many usage examples for performing various operations (listed below) with datasets through Dataiku Python API. It is designed to give an overview of the main capabilities but is not an exhaustive documentation.

For reference exhaustive documentation, please see Datasets (reference)

* Basic operations

+ Listing datasets

+ Deleting a dataset

+ Modifying tags for a dataset

+ Reading and modifying the schema of a dataset

+ Building a dataset

* Programmatic creation and setup (external datasets)

+ SQL dataset: Programmatic creation

+ SQL dataset: Modifying settings

+ Files-based dataset: Programmatic creation

- Generic method for most connections

- Quick helpers for some connections

+ Uploaded datasets: programmatic creation and upload

+ Manual creation

* Programmatic creation and setup (managed datasets)

+ Creating a new SQL managed dataset

+ Creating a new Files-based managed dataset with a specific schema

+ Creating a new partitioned managed dataset

* Flow handling

+ Creating recipes from a dataset

* ML & Statistics

+ Creating ML models

+ Creating statistics worksheets

* Misc operations

+ Listing partitions

+ Clearing data

+ Hive operations

In all examples, project is a `dataikuapi.dss.project.DSSProject` handle, obtained using client.get\_project() or client.get\_default\_project()

## Basic operations[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#basic-operations "Permalink to this headline")

### Listing datasets[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#listing-datasets "Permalink to this headline")

§ datasets = project.list\_datasets()

§ # Returns a list of DSSDatasetListItem

§ for dataset in datasets:

§ # Quick access to main information in the dataset list item

§ print("Name: %s" % dataset.name)

§ print("Type: %s" % dataset.type)

§ print("Connection: %s" % dataset.connection)

§ print("Tags: %s" % dataset.tags) # Returns a list of strings

§ # You can also use the list item as a dict of all available dataset information

§ print("Raw: %s" % dataset)

outputs

§ Name: train\_set

§ Type: Filesystem

§ Connection: filesystem\_managed

§ Tags: ["creator\_admin"]

§ Raw: {  'checklists': {   'checklists': []},

§ 'customMeta': {   'kv': {   }},

§ 'flowOptions': {   'crossProjectBuildBehavior': 'DEFAULT',

§ 'rebuildBehavior': 'NORMAL'},

§ 'formatParams': {  /\* Parameters specific to each format type \*/ },

§ 'formatType': 'csv',

§ 'managed': False,

§ 'name': 'train\_set',

§ 'tags' : ["mytag1"]

§ 'params': { /\* Parameters specific to each dataset type \*/ "connection" : "filesystem\_managed" },

§ 'partitioning': {   'dimensions': [], 'ignoreNonMatchingFile': False},

§ 'projectKey': 'TEST\_PROJECT',

§ 'schema': {   'columns': [   {     'name': 'col0',

§ 'type': 'string'},

§ {   'name': 'col1',

§ 'type': 'string'},

§ /\* Other columns ... \*/

§ ],

§ 'userModified': False},

§ 'tags': ['creator\_admin'],

§ 'type': 'Filesystem'},

§ ...

§ ]

### Deleting a dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#deleting-a-dataset "Permalink to this headline")

§ dataset = project.get\_dataset('TEST\_DATASET')

§ dataset.delete(drop\_data=True)

### Modifying tags for a dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#modifying-tags-for-a-dataset "Permalink to this headline")

§ dataset = project.get\_dataset("mydataset")

§ settings = dataset.get\_settings()

§ print("Current tags are %s" % settings.tags)

§ # Change the tags

§ settings.tags = ["newtag1", "newtag2"]

§ # If we changed the settings, we must save

§ settings.save()

### Reading and modifying the schema of a dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#reading-and-modifying-the-schema-of-a-dataset "Permalink to this headline")

Warning

Modifying schema or settings of a dataset from within a DSS job should be done use dataiku.Dataset, NOT using DSSDataset

Using DSSDataset to modify the schema from within a DSS job would not be taken into account for subsequent activities in the job.

§ dataset = project.get\_dataset("mydataset")

§ settings = dataset.get\_settings()

§ for column in settings.schema\_column:

§ print("Have column name=%s type=%s" % (column["name"], column["type"]))

§ # Now, let's add a new column in the schema

§ settings.add\_raw\_schema\_column({"name" : "test", "type": "string"})

§ # If we changed the settings, we must save

§ settings.save()

### Building a dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#building-a-dataset "Permalink to this headline")

You can start a job in order to build the dataset

§ dataset = project.get\_dataset("mydataset")

§ # Build the dataset non recursively and waits for build to complete.

§ #Returns a :meth:`dataikuapi.dss.job.DSSJob`

§ job = dataset.build()

§ # Builds the dataset recursively

§ dataset.build(job\_type="RECURSIVE\_BUILD")

§ # Build a partition (for partitioned datasets)

§ dataset.build(partitions="partition1")

## Programmatic creation and setup (external datasets)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#programmatic-creation-and-setup-external-datasets "Permalink to this headline")

The API allows you to leverage Dataiku’s automatic detection and configuration capabilities in order to programmatically create datasets or programmatically “autocomplete” the settings of a dataset.

### SQL dataset: Programmatic creation[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#sql-dataset-programmatic-creation "Permalink to this headline")

§ dataset = project.create\_sql\_table\_dataset("mydataset", "PostgreSQL", "my\_sql\_connection", "mytable", "myschema")

§ # At this point, the dataset object has been initialized, but the schema of the underlying table

§ # has not yet been fetched, so the schema of the table and the schema of the dataset are not yet consistent

§ # We run autodetection

§ settings = dataset.autodetect\_settings()

§ # settings is now an object containing the "suggested" new dataset settings, including the completed schema

§ # We can just save the new settings in order to "accept the suggestion"

§ settings.save()

### SQL dataset: Modifying settings[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#sql-dataset-modifying-settings "Permalink to this headline")

The object returned by `dataikuapi.dss.dataset.DSSDataset.get\_settings()` depends on the kind of dataset.

For a SQL dataset, it will be a `dataikuapi.dss.dataset.SQLDatasetSettings`.

§ dataset = project.get\_dataset("mydataset")

§ settings = dataset.get\_settings()

§ # Set the table targeted by this SQL dataset

§ settings.set\_table(connection="myconnection", schema="myschema", table="mytable")

§ settings.save()

§ # If we have changed the table, there is a good chance that the schema is not good anymore, so we must

§ # have DSS redetect it. `autodetect\_settings` will however only detect if the schema is empty, so let's clear it.

§ del settings.schema\_columns[:]

§ settings.save()

§ # Redetect and save the suggestion

§ settings = dataset.autodetect\_settings()

§ settings.save()

### Files-based dataset: Programmatic creation[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#files-based-dataset-programmatic-creation "Permalink to this headline")

#### Generic method for most connections[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#generic-method-for-most-connections "Permalink to this headline")

This applies to all files-based datasets, but may require additional setup

§ dataset = project.create\_fslike\_dataset("mydataset", "HDFS", "name\_of\_connection", "path\_in\_connection")

§ # At this point, the dataset object has been initialized, but the format is still unknown, and the

§ # schema is empty, so the dataset is not yet usable

§ # We run autodetection

§ settings = dataset.autodetect\_settings()

§ # settings is now an object containing the "suggested" new dataset settings, including the detected format

§ # and completed schema

§ # We can just save the new settings in order to "accept the suggestion"

§ settings.save()

#### Quick helpers for some connections[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#quick-helpers-for-some-connections "Permalink to this headline")

§ # For S3: allows you to specify the bucket (if the connection does not already force a bucket)

§ dataset = project.create\_s3\_dataset(dataset\_name, connection, path\_in\_connection, bucket=None)

### Uploaded datasets: programmatic creation and upload[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#uploaded-datasets-programmatic-creation-and-upload "Permalink to this headline")

§ dataset = project.create\_upload\_dataset("mydataset") # you can add connection= for the target connection

§ with open("localfiletoupload.csv", "rb") as f:

§ dataset.uploaded\_add\_file(f, "localfiletoupload.csv")

§ # At this point, the dataset object has been initialized, but the format is still unknown, and the

§ # schema is empty, so the dataset is not yet usable

§ # We run autodetection

§ settings = dataset.autodetect\_settings()

§ # settings is now an object containing the "suggested" new dataset settings, including the detected format

§ # andcompleted schema

§ # We can just save the new settings in order to "accept the suggestion"

§ settings.save()

### Manual creation[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#manual-creation "Permalink to this headline")

You can create and setup all parameters of a dataset yourself. We do not recommend using this method.

For example loading the csv files of a folder

§ project = client.get\_project('TEST\_PROJECT')

§ folder\_path = 'path/to/folder/'

§ for file in listdir(folder\_path):

§ if not file.endswith('.csv'):

§ continue

§ dataset = project.create\_dataset(file[:-4]  # dot is not allowed in dataset names

§ ,'Filesystem'

§ , params={

§ 'connection': 'filesystem\_root'

§ ,'path': folder\_path + file

§ }, formatType='csv'

§ , formatParams={

§ 'separator': ','

§ ,'style': 'excel'  # excel-style quoting

§ ,'parseHeaderRow': True

§ })

§ df = pandas.read\_csv(folder\_path + file)

§ dataset.set\_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

## Programmatic creation and setup (managed datasets)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#programmatic-creation-and-setup-managed-datasets "Permalink to this headline")

Managed datasets are much easier to create because they are managed by DSS

### Creating a new SQL managed dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-a-new-sql-managed-dataset "Permalink to this headline")

§ builder = project.new\_managed\_dataset("mydatasetname")

§ builder.with\_store\_into("mysqlconnection")

§ dataset = builder.create()

### Creating a new Files-based managed dataset with a specific schema[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-a-new-files-based-managed-dataset-with-a-specific-schema "Permalink to this headline")

§ builder = project.new\_managed\_dataset("mydatasetname")

§ builder.with\_store\_into("myhdfsconnection", format\_option\_id="PARQUET\_HIVE")

§ dataset = builder.create()

### Creating a new partitioned managed dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-a-new-partitioned-managed-dataset "Permalink to this headline")

This dataset copies partitioning from an existing dataset

§ builder = project.new\_managed\_dataset("mydatasetname")

§ builder.with\_store\_into("myhdfsconnection")

§ builder.with\_copy\_partitioning\_from("source\_dataset")

§ dataset = builder.create()

## Flow handling[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#flow-handling "Permalink to this headline")

For more details, please see Flow creation and management on programmatic flow building.

### Creating recipes from a dataset[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-recipes-from-a-dataset "Permalink to this headline")

This example creates a sync recipe to sync a dataset to another

§ recipe\_builder = dataset.new\_recipe("sync")

§ recipe\_builder.with\_new\_output("target\_dataset", "target\_connection\_name")

§ recipe = builder.create()

§ # recipe is now a :class:`dataikuapi.dss.recipe.DSSRecipe`, and you can run it

§ recipe.run()

This example creates a code recipe from this dataset

§ recipe\_builder = dataset.new\_recipe("python")

§ recipe\_builder.with\_code("""

§ import dataiku

§ from dataiku import recipe

§ input\_dataset = recipe.get\_inputs\_as\_datasets()[0]

§ output\_dataset = recipe.get\_outputs\_as\_datasets()[0]

§ df = input\_dataset.get\_dataframe()

§ df = df.groupby("mycol").count()

§ output\_dataset.write\_with\_schema(df)

§ """)

§ recipe\_builder.with\_new\_output\_dataset("target\_dataset", "target\_connection\_name")

§ recipe = builder.create()

§ # recipe is now a :class:`dataikuapi.dss.recipe.DSSRecipe`, and you can run it

§ recipe.run()

## ML & Statistics[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#ml-statistics "Permalink to this headline")

### Creating ML models[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-ml-models "Permalink to this headline")

You can create a ML Task in order to train models based on a dataset. See Machine learning for more details.

§ dataset = project.get\_dataset('mydataset')

§ mltask = dataset.create\_prediction\_ml\_task("variable\_to\_predict")

§ mltask.train()

### Creating statistics worksheets[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#creating-statistics-worksheets "Permalink to this headline")

For more details, please see Statistics worksheets

§ dataset = project.get\_dataset('mydataset')

§ ws = datasets.create\_statistics\_worksheet(name="New worksheet")

## Misc operations[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#misc-operations "Permalink to this headline")

### Listing partitions[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#listing-partitions "Permalink to this headline")

For partitioned datasets, the list of partitions is retrieved with list\_partitions():

§ partitions = dataset.list\_partitions()

§ # partitions is a list of string

### Clearing data[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#clearing-data "Permalink to this headline")

The rows of the dataset can be cleared, entirely or on a per-partition basis, with the clear() method.

§ dataset = project.get\_dataset('SOME\_DATASET')

§ dataset.clear(['partition\_spec\_1', 'partition\_spec\_2'])         # clears specified partitions

§ dataset.clear()                                                                                         # clears all partitions

### Hive operations[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-other.html#hive-operations "Permalink to this headline")

For datasets associated with a table in the Hive metastore, the synchronization of the table definition in the metastore with the dataset’s schema in DSS will be needed before it can be visible to Hive, and usable by Impala queries.

§ dataset = project.get\_dataset('SOME\_HDFS\_DATASET')

§ dataset.synchronize\_hive\_metastore()

Or in the other direction, to synchronize the dataset’s information from Hive

§ dataset = project.get\_dataset('SOME\_HDFS\_DATASET')

§ dataset.update\_from\_hive()

§ # This will have the updated settings

§ settings = dataset.get\_settings()
