# DSS and Hadoop[¶](https://doc.dataiku.com/dss/latest/hadoop/index.html#dss-and-hadoop "Permalink to this headline")

* Setting up Hadoop integration

+ Prerequisites

- Supported distributions

- Non officially supported distributions

- Software install

- HDFS

- Hive

+ Testing Hadoop connnectivity prior to installation

- hive binary

+ Setting up DSS Hadoop integration

- Test HDFS connection

- Standalone Hadoop integration

- Configure Hive connectivity

- Configure Impala connectivity

+ Pig support

+ Secure Hadoop connectivity

* Connecting to secure clusters

+ Setup the DSS Kerberos account

+ Configure DSS for Hadoop security

- Test HDFS connection

- Configure Hive connectivity

- Configure Impala connectivity

+ Modification of principal or keytab

+ Advanced settings (optional)

- Configuring Kerberos credentials periodic renewal

* Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS)

+ HDFS connections in DSS

- Managed datasets setup

+ Connecting to the “default” FS

+ Connecting to the HDFS of other clusters

+ Connecting to S3

- Using S3A

- Using EMRFS

- Using VPC Endpoints

+ Connecting to Azure Blob Storage

+ Connecting to Google Cloud Storage

+ Connecting to Azure Data Lake Store (gen1)

+ Connecting to Azure Data Lake Store (gen2)

+ Additional details

- Cloud storage credentials

- Checking access to a Hadoop filesystem

- Relation to the Hive metastore

* Hive

+ Interaction with the Hive global metastore

+ Synchronisation to the Hive metastore

- For external datasets

+ Importing from the Hive metastore

+ Hive execution engines

- Notebooks and metrics

- Recipes

* Hiveserver 2

* Hive CLI (global metastore)

* Hive CLI (isolated metastore)

* Choosing the mode

* Configuring the mode

+ Support for Hive authentication modes

+ Support for Hive authorization modes

- No Hive security (No DSS User Isolation)

- Sentry (No DSS User Isolation)

* With ACL synchronization (recommended)

* With permissions inheritance

- Sentry (DSS User Isolation enabled)

- Ranger (No DSS User Isolation)

- Ranger (DSS User Isolation enabled)

- Storage-based security (No DSS User Isolation)

- Cloudera-specific note

+ Supported file formats

- Limitations

+ Internal details

* Impala

+ Impala connectivity

+ Metastore synchronization

+ Supported formats and limitations

+ Configuring connection to Impala servers

- Kerberos authentication (secure clusters)

- LDAP authentication

+ Using Impala to write outputs

- No Hive authorization (DSS regular security)

- Sentry (DSS regular security)

* With ACL synchronization (recommended)

* With permissions inheritance

- Sentry (DSS User Isolation Framework)

- Switching from write-through-DSS to write-through-Impala

* Spark

+ Spark provided by your Hadoop distribution

- Verify the installation

+ Manual Spark setup

- Prepare your Spark environment

- Set up Spark integration with DSS

- Verify the installation

+ Additional topics

- Caveat for RedHat / CentOS 6.x clusters

- Metastore security

- Configure Spark logging

* Hive datasets

+ Use cases

- Hive views

- No read access on source files

- ACID tables (ORC)

- DATE and DECIMAL data types

+ Creating a Hive dataset

- New dataset

- Import

+ Using a Hive dataset

- Hive recipes

- Visual recipes with Hive as execution engine

- Spark recipes

- Visual recipes with Spark as execution engine

- Limitations

* Hadoop user isolation

* Distribution-specific notes

+ Cloudera CDP

- Spark support

- Security

+ Cloudera CDH

- Security

* DSS regular security and Sentry

- Scala notebook

- S3 datasets and Spark 2

- Impala

+ Hortonworks HDP

- HDP 3.1 support

- Limitations

- Security

* DSS regular security and Ranger

* DSS User Isolation Framework and Ranger

- Migrating to HDP 3.X

+ Amazon Elastic MapReduce

- Supported versions

- Security

- Deployment scenarios

* Let DSS dynamically manage one or several EMR clusters

* Connect DSS to an existing EMR cluster

+ DSS running on one of the cluster nodes

+ DSS outside of the cluster

* Connect DSS to multiple existing EMR clusters

- Using EMRFS

* EMRFS credentials

+ Google Cloud Dataproc

- Security

- Known limitations

- Connecting DSS to Cloud Dataproc

* DSS running on one of the cluster nodes

* DSS outside of the cluster

* Teradata Connector For Hadoop

+ Installation and configuration

+ Usage and Guidelines

+ Limitations

* Multiple Hadoop clusters

+ Concepts

- Builtin cluster

- Additional clusters

- Managed dynamic clusters

- Use an additional cluster

* Per-scenario additional clusters

+ Restrictions

+ Define an additional static cluster

- Hadoop

- Hive

- Impala

- Spark

+ Add a dynamic additional cluster

+ Use a specific or dynamic cluster for scenarios

- Use a specific static cluster

- Use a dynamic cluster

+ Permissions

* Dynamic AWS EMR clusters

+ Prerequisites and limitations

+ Create your first cluster

- Machine setup

- AWS credentials

- Install the plugin

- Define EMRFS connections

- Create the cluster and configure it

- Use your cluster

- Stop the cluster

+ Using dynamic EMR clusters for scenarios

+ Cluster actions

- Manual run

- As part of a scenario

+ Advanced settings

- Security settings

- Metastore

- Tags

- Misc

* Dynamic Google Dataproc clusters

+ Prerequisites and limitations

+ Create your first cluster

- Before Running

- Install the plugin

- Define GCS connections

- Create the cluster and configure it

- Use your cluster

- Stop the cluster

+ Using dynamic Dataproc clusters for scenarios

+ Cluster actions

- Manual run

- As part of a scenario

+ Advanced settings

- Hive metastore

- Labels
