Dataiku DSS - Reference documentation
Welcome to the reference documentation for Dataiku DSS. This documentation contains:
-
The main documentation of the concepts, interfaces and features of Dataiku DSS
-
Information on how to install and configure Dataiku DSS
-
Information for administrators on how to operate Dataiku DSS
You might also find these other resources useful:
-
The Knowledge Base a variety of topics that can help you to learn more about Dataiku DSS, or find solutions to problems without having to ask for help.
-
The Developer Guide contains all information for developers using Dataiku: how to code in Dataiku, how to create applications, how to operate Dataiku through its APIs, numerous code samples and examples, and reference API documentation
-
The Dataiku Academy provides guided learning paths for you to follow, upskill, and gain certifications on Dataiku DSS.
-
The Dataiku Community is a place where you can join the discussion, get support, share best practices and engage with other Dataiku users.
- DSS concepts
-
Connecting to data
- Supported connections
- SQL databases
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- Upload your files
- HDFS
- Cassandra
- MongoDB
- Elasticsearch
- File formats
- Managed folders
- “Files in folder” dataset
- Metrics dataset
- Internal stats dataset
- “Editable” dataset
- kdb+
- FTP
- SCP / SFTP (aka SSH)
- HTTP
- HTTP (with cache)
- Server filesystem
- Dataset plugins
- Making relocatable managed datasets
- Clearing non-managed Datasets
- Data ordering
- PI System / PIWebAPI server
- Google Sheets
- Data transfer on Dataiku Cloud
- Exploring your data
- Schemas, storage types and meanings
- Data preparation
- Charts
- Interactive statistics
-
Machine learning
- Prediction (Supervised ML)
- Clustering (Unsupervised ML)
- Automated machine learning
- Model Settings Reusability
- Features handling
- Algorithms reference
- Advanced models optimization
- Models ensembling
- Model Document Generator
- Time Series Forecasting
- Causal Prediction
- Deep Learning
- Models lifecycle
- Scoring engines
- Writing custom models
- Exporting models
- Partitioned Models
- ML Diagnostics
- Computer vision
- Image labeling
- The Flow
-
Visual recipes
- Prepare: Cleanse, Normalize, and Enrich
- Sync: copying datasets
- Grouping: aggregating data
- Window: analytics functions
- Distinct: get unique rows
- Join: joining datasets
- Fuzzy join: joining two datasets
- Geo join: joining datasets based on geospatial features
- Splitting datasets
- Top N: retrieve first N rows
- Stacking datasets
- Sampling datasets
- Sort: order values
- Pivot recipe
- Generate features
- Push to editable recipe
- Download recipe
- List Folder Contents
- Recipes based on code
- Code notebooks
- MLOps
- Webapps
- Code Studios
- Code reports
- Dashboards
- Workspaces
- Data Catalog
- Dataiku Applications
- Working with partitions
- DSS and SQL
- DSS and Python
- DSS and R
- DSS and Spark
- Code environments
- Collaboration
- Time Series
- Geographic data
- Generative AI and LLM Mesh
-
Text & Natural Language Processing
- Language Detection
- Named Entities Extraction
- Sentiment Analysis
- Translation
- Text summarization
- Key phrase extraction
- Ontology Tagging
- Spell checking
- OpenAI GPT
- Machine Learning with Text features
- Text extraction
- OCR (Optical Character recognition)
- Speech-to-Text
- Text cleaning
- Text Embedding
- NLP using AWS APIs
- NLP using Azure APIs
- NLP with Crowlingo API
- NLP using Deepl API
- NLP using Google APIs
- NLP with MeaningCloud API
- Images
- Audio
- Video
- Automation scenarios, metrics, and checks
- Production deployments and bundles
-
API Node & API Deployer: Real-time APIs
- Introduction
- Concepts
- Installing API nodes
- Setting up the API Deployer and deployment infrastructures
- First API (with API Deployer)
- First API (without API Deployer)
- Deploying to an external platform
- Types of Endpoints
- Enriching prediction queries
- Documenting your API endpoints
- Security
- Managing versions of your endpoint
- Deploying on Kubernetes
- APINode APIs reference
- Operations reference
- Governance
- Python APIs
- R API
- Public REST API
- Additional APIs
- Installing and setting up
-
Elastic AI computation
- Concepts
- Initial setup
- Managed Kubernetes clusters
- Using Amazon Elastic Kubernetes Service (EKS)
- Using Microsoft Azure Kubernetes Service (AKS)
- Using Google Kubernetes Engine (GKE)
- Using code envs with containerized execution
- Dynamic namespace management
- Customization of base images
- Unmanaged Kubernetes clusters
- Using Openshift
- Using NVIDIA DGX Systems
- Troubleshooting
- Using Docker instead of Kubernetes
- DSS in the cloud
-
DSS and Hadoop
- Setting up Hadoop integration
- Connecting to secure clusters
- Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)
- Hive
- Impala
- Spark
- Hive datasets
- Hadoop user isolation
- Distribution-specific notes
- Teradata Connector For Hadoop
- Multiple Hadoop clusters
- Dynamic AWS EMR clusters
- Dynamic Google Dataproc clusters
- Metastore catalog
-
Operating DSS
- dsscli tool
- The data directory
- Backing up
- Audit trail
- The runtime databases
- Logging in DSS
- DSS Macros
- Managing DSS disk usage
- Understanding and tracking DSS processes
- Tuning and controlling memory usage
- Using cgroups for resource control
- Monitoring DSS
- HTTP proxies
- DSS license
- Compute resource usage reporting
- Security
- User Isolation
- Plugins
- Streaming data
-
Formula language
- Basic usage
- Reading column values
- Variables typing and autotyping
- Boolean values
- Operators
- Array and object operations
- Object notations
- DSS variables
- Array functions
- Boolean functions
- Date functions
- Math functions
- Object functions
- String functions
- Geometry functions
- Value access functions
- Control structures
- Tests
- Custom variables expansion
- Sampling methods
- Accessibility
- Troubleshooting
-
Release notes
- DSS 12 Release notes
- DSS 11 Release notes
- DSS 10.0 Release notes
- DSS 9.0 Release notes
- DSS 8.0 Release notes
- DSS 7.0 Release notes
- DSS 6.0 Release notes
- DSS 5.1 Release notes
- DSS 5.0 Release notes
- DSS 4.3 Release notes
- DSS 4.2 Release notes
- DSS 4.1 Release notes
- DSS 4.0 Release notes
- DSS 3.1 Release notes
- DSS 3.0 Relase notes
- DSS 2.3 Relase notes
- DSS 2.2 Relase notes
- DSS 2.1 Relase notes
- DSS 2.0 Relase notes
- DSS 1.4 Relase notes
- DSS 1.3 Relase notes
- DSS 1.2 Relase notes
- DSS 1.1 Release notes
- DSS 1.0 Release Notes
- Pre versions
- Other Documentation
- Third-party acknowledgements