Spark on Kubernetes – An overview

What is Spark on Kubernetes?

Kubernetes (also known as Kube or k8s) is an open-source container orchestration system initially developed at Google, open-sourced in 2014 and maintained by the Cloud Native Computing Foundation. Kubernetes is used to automate deployment, scaling and management of containerized apps – most commonly Docker containers.

Spark on Kubernetes added the advantage of using the above features of Kubernetes and replacing Yarn, Mesos etc as a de facto resource. management and scheduling mechanism.

Motivations behind Spark on Kubernetes:

  • Containerized Spark to provide shared resources across all the Data engineering and Machine Learning jobs
  • Out of the box support for multiple Spark versions, Python versions, and version-controlled containers on the shared K8s clusters
  • A single, unified infrastructure for both majority of batch workloads and microservices
  • Fine-grained access controls on shared clusters

Challenges in Spark on Yarn solved by Kubernetes:

Based on the above diagram we can see below the advantages of Kubernetes over Yarn:

  • Providing consistency from run to run
  • Isolating performance between different tenants
  • OS file cache penalty due to different executors running in a common node manager

Spark on Kubernetes Architecture 

Above architecture flow can be broken down to below steps:

  • Client submits Spark submit with arguments
  • Arguments get converted to pod-spec for the driver
  • Scheduler backend on Kubernetes requests executors pods in batches

Below is a Kubernetes terminology view of the flow.

However, the problem in this approach is the pods are launched one by one hence there can be a situation where an application gets stuck due to unavailability of executor pods. To resolve this various techniques are used as bellow:

  • Gang scheduling – Run pods app by app instead of executor/driver in a FIFO manner

  • Yunikorn- Maintains job ordering, resource capacity management and fairness

Spark 3 Support for Kubernetes

 Kubernetes support began in v 2.3.x, However new features for Spark 3 are as below: 

  • Pod Templates 
  • GPU-aware scheduling (Isolation at GOIU Pod Level)
  • Improve behaviour with dynamic allocation
  • The Kerberos authentication protocol is now supported in Kubernetes resource manager.

To see Step by Step commands to run Spark on Kubernetes, See Apache documentation here.

P.S: Some references have been taken in above content from various blogs belonging to Cloudera, Palantir, Datamechanics etc.

(Visited 115 times, 1 visits today)
October 18, 2020

0 responses on "Spark on Kubernetes - An overview"

    Leave a Message

    Your email address will not be published. Required fields are marked *