Skip to content

How Spark Works

Cluster Overview

Architecture Components

Driver Program

The process running the main() function of the application and creating the SparkContext

Cluster Manager

An external service for acquiring resources on the cluster (e.g. standalone manager, YARN, Kubernetes)

Worker Node

Any node that can run application code in the cluster

Executor

A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

See Cluster Mode Overview for more details on the architecture components.

Kubeflow Spark Operator

When Spark 2.3 introduced Kubernetes as an official scheduler backend, it created challenges in managing Spark application lifecycles on Kubernetes. Unlike traditional workloads such as Deployments and StatefulSets, Spark applications required different approaches for submission, execution, and monitoring, making them difficult to manage idiomatically within Kubernetes environments.

The Kubeflow Spark Operator addresses these challenges by implementing the operator pattern to manage Spark applications declaratively. It allows users to specify Spark applications in YAML files without dealing with complex spark-submit processes, while providing native Kubernetes-style status tracking and monitoring capabilities that align with other Kubernetes workloads.

Architecture diagram of the Kubeflow Spark Operator

Spark Connect

Introduced in Spark 3.4, Spark Connect, decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol

The Spark Connect API is a language-agnostic protocol

How Spark Connect Works