Reference architecture: managed compute on GKE and storage on GCS
Overview
This architecture document explains how to deploy:
-
A DSS instance running on a Google Compute Engine (GCE) virtual machine
-
Dynamically-spawned Google Kubernetes Engine (GKE) clusters for computation (Python and R recipes/notebooks, in-memory visual ML, visual and code Spark recipes, Spark notebooks)
-
Ability to store data in Google Cloud Storage
Security
The
dssuser
needs to be authenticated on the GCE machine hosting DSS with a GCP Service Account that has sufficient permissions to:
manage GKE clusters
push Docker images to Google Container Registry (GCR)
Main steps
Prepare the instance
-
Setup a CentOS 7 GCE machine and make sure that:
-
you select the right Service Account
-
you set the access scope to “read + write” for the Storage API
-
-
Install and configure Docker CE
-
Install kubectl
-
Setup a non-root user for the
dssuser
Install DSS
-
Download DSS, together with the “generic-hadoop3” standalone Hadoop libraries and standalone Spark binaries.
-
Install DSS, see Installing and setting up
-
Build base container-exec and Spark images, see Initial setup
Setup containerized execution configuration in DSS
-
Create a new “Kubernetes” containerized execution configuration
-
Set
gcr.io/your-gcp-project
as the “Image registry URL” -
Push base images
Setup Spark and metastore in DSS
-
Create a new Spark configuration and enable “Managed Spark-on-K8S”
-
Set
gcr.io/your-gcp-project
as the “Image registry URL” -
Push base images
-
Set metastore catalog to “Internal DSS catalog”
Setup GCS connections
-
Setup as many GCS connections as required, with appropriate credentials and permissions
-
Make sure that “GS” is selected as the HDFS interface
Install GKE plugin
-
Install the GKE plugin
-
Create a new “GKE connections” preset and fill in :
-
the GCP project key
-
the GCP zone
-
-
Create a new “Node pools” preset and fill in:
-
the machine type
-
the number of nodes
-
Create your first cluster
-
Create a new cluster, select “create GKE cluster” and enter the desired name
-
Select the previously created presets and create the cluster
-
Cluster creation takes around 5 minutes
Use your cluster
-
Create a new DSS project and configure it to use your newly-created cluster
-
You can now perform all Spark operations over Kubernetes
-
GCS datasets that are built will sync to the local DSS metastore