Reference architecture: manage compute on AKS and storage on ADLS gen2
Overview
This architecture document explains how to deploy:
-
A DSS instance running on an Azure virtual machine
-
Dynamically-spawned Azure Kubernetes Service (AKS) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
-
Ability to store data in Azure DataLake Storage (ADLS) gen2
Security
We assume that all operations described here are done within a common Azure Resource Group (RG), in which the Service Principal (SP) you are using has sufficient permissions to:
-
Manage AKS clusters
-
Push Docker images to Azure Container Registry (ACR)
-
Read/write from/to ADLS gen 2
Main steps
Prepare the instance
-
Setup a CentOS 7 Azure VM in your target RG
-
Install and configure Docker CE
-
Install kubectl
-
Setup a non-root user for the
dssuser
Install DSS
-
Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
-
Install DSS, see Installing and setting up
-
Build base container-exec and Spark images, see Initial setup
Setup containerized execution configuration in DSS
-
Create a new “Kubernetes” containerized execution configuration
-
Set
your-cr.azurecr.io
as the “Image registry URL” -
Push base images
Setup Spark and metastore in DSS
-
Create a new Spark configuration and enable “Managed Spark-on-K8S”
-
Set
your-cr.azurecr.io
as the “Image registry URL” -
Push base images
-
Set metastore catalog to “Internal DSS catalog”
Setup ADLS gen2 connections
-
Setup as many Azure blob storage connections as required, with appropriate credentials and permissions
-
Make sure that “ABFS” is selected as the HDFS interface
Install AKS plugin
-
Install the AKS plugin
-
Create a new “AKS connection” preset and fill in:
-
the Azure subscription ID
-
the tenant ID
-
the client ID
-
the password (client secret)
-
-
Create a new “Node pools” preset and fill in:
-
the machine type
-
the default number of nodes
-
Create your first cluster
-
Create a new cluster, select “Create AKS cluster” and enter the desired name
-
Select the previously created presets
-
In the “Advanced options” section, type an IP range for the Service CIDR (e.g. 10.0.0.0/16) and an IP address for the DNS IP (e.g. 10.0.0.10).
-
Click on “Start/attach”. Cluster creation takes between 5 and 10 minutes.
Use your cluster
-
Create a new DSS project and configure it to use your newly-created cluster
-
You can now perform all Spark operations over Kubernetes
-
ADLS gen2 datasets that are built will sync to the local DSS metastore.