🇬🇧 Understanding Spark: Unveiling the Behind-the-Scenes of Distributed Processing

Posted on 20 de November de 2024 by Gustavo Santana

🇧🇷 – para ler este artigo em português clique aqui

If you’re here, you’ve probably had some contact with Spark or at least heard about its use in processing large volumes of data. This article is not a beginner’s tutorial, nor will it teach you basic commands. My goal is to explore what really happens behind the scenes in Spark, unraveling the concepts that make this tool so powerful. We’ll discuss the role of workers, cores, partitions, how distributed processing works, the difference between narrow and wide transformations, and how to optimize resources to get the best performance.

First of all, it’s worth understanding what a Spark cluster is, as it is the heart of any distributed operation. A Spark cluster is essentially a set of machines working together to process data in parallel. It consists of different roles: the driver, which coordinates execution, and the executors, which actually process the data. Each executor runs on a worker node and is responsible for a part of the work, utilizing resources like CPU (cores) and memory. This division of tasks is what allows Spark to scale and handle massive datasets.

Now that you know where everything happens, let’s dive into the inner workings. How do workers receive and process data? How does Spark decide how many partitions to create and how to distribute the work? Understanding this is crucial for adjusting configurations and avoiding bottlenecks, whether in resource allocation or in using efficient transformations.

Driver

In a Spark cluster, the driver is the brain of the operation. It coordinates everything: sends tasks to the workers (we’ll talk more about them), tracks progress, and collects results. When you write Spark code, the driver is the one that interprets your instructions and transforms the work into distributed stages.

The driver runs on the machine you use to start the job (or on a specific node of the cluster, depending on the configuration). It maintains the SparkContext, which is like a command center, and also stores the DAG (Directed Acyclic Graph), which represents the order of operations to be executed.

An important detail: the driver needs sufficient resources. If the data volume is too large and it doesn’t have enough memory, you could end up with a bottleneck. It is also responsible for dividing the data into partitions (we’ll talk more about them) and deciding which tasks go to which workers. Therefore, a well-sized driver is just as crucial as the executors in your cluster.

Workers

The workers are the machines that do the heavy lifting in Spark. Each worker executes specific tasks, processing parts of the data distributed by the driver. Inside a worker, you have the executors (we’ll talk more about them), which are responsible for running the tasks and storing the data necessary for processing.

Think of the workers as “arms” that receive orders from the driver and get to work. They use CPU (cores) and available memory to process partitions of data. The more workers and resources you have, the greater the processing power.

What happens behind the scenes is that the driver breaks the work into many small tasks, distributes these tasks to the workers, and monitors progress. If a worker fails, Spark can redistribute the task to another, ensuring the system’s resilience.

Cores

The cores are the processing units within each worker. They represent how many tasks can be executed simultaneously in an executor. Each executor gets a number of cores and uses them to process the data partitions.

In Spark, more cores mean greater parallelism. If an executor has 4 cores, it can run up to 4 tasks simultaneously. However, it’s no use overdoing it: the number of cores needs to be balanced with the number of partitions and the cluster size. If you give too many cores to too few executors, you could end up wasting resources.

An important detail: tasks that depend on complex operations, like wide transformations (we’ll talk more about them), can consume more CPU. Therefore, adjusting the number of cores per executor is essential to avoid bottlenecks and maximize performance.

Calculation:

Calculating the ideal number of workers and cores in Spark is crucial to make the most of the cluster’s resources. The basic formula is:

Total available cores = Number of workers × Cores per worker.

If in the configurations, you set your processing to have 5 workers and 8 cores, you will have a total of 40 cores to process tasks in parallel. So your processing will run on 40 cores.

When configuring, remember:

Leave 1 core per worker for the operating system, i.e., use 7 cores instead of 8.
The number of tasks should be greater than the number of available cores to ensure that all partitions are processed without bottlenecks.
This balanced distribution avoids resource wastage and improves the overall performance of the job.

Partitions:

Partitions are the smallest data units Spark works with. When you load a dataset, Spark splits it into smaller parts, called partitions, to distribute the processing across workers. Each task in Spark processes one partition at a time.

The number of partitions directly affects performance. Too many partitions can overload the cluster, while too few partitions can leave resources underutilized. The general rule is that the number of partitions should be greater than the total number of available cores, so that all cores have work to do at all times. Typically, you should have at least 2 to 3 times the total number of cores.

Example: If we define 40 Cores (5 Workers with 8 Cores each), we need 160 to 240 partitions.

You can adjust the number of partitions using functions like repartition() or by configuring spark.sql.shuffle.partitions for operations that generate new partitions, like joins and aggregations. A good partition adjustment can make all the difference in the execution time of your jobs.

By default, Spark processes 128 MB per partition (this value can be changed). Therefore, it is ideal to adjust the partitions of a file according to its size.

Example: If a Parquet file is 100 GB, the calculation would be:
Partitions = 100 × 1024 / 128 = 800

So, ideally, it would be close to 800 partitions. To adjust this, you can use coalesce(800) or a similar value if you want to manually decrease the number of partitions.

This calculation helps prevent overmemory (we’ll talk more about it), ensuring that partitions are balanced against available resources.

Memory:

Memory in Spark is divided between data storage and task execution. It needs to be well-managed to avoid issues like overmemory, which happens when tasks require more memory than the executor has available. This can cause failures or slowdowns, as Spark starts using disk as a backup (spill).

Another important point is the overhead, which is the memory reserved to manage Spark itself (such as metadata, task information, and other internal operations). By default, Spark allocates 10% of the executor’s total memory for overhead, but this can be adjusted with the spark.executor.memoryOverhead configuration.

If your application is consuming too much memory, consider:

Increasing the number of partitions to reduce the size of each one.
Adjusting the executor memory size (spark.executor.memory).
Reviewing the code to avoid unnecessary actions, such as excessive collections to the driver.

This helps minimize problems and ensures efficient use of the cluster’s resources.

Connecting the Pieces:

Let’s build a practical example connecting all the concepts we’ve discussed.

Example: Loading a 100 GB Parquet file into Spark

Suppose you need to load a 100 GB Parquet file into Spark, and you have a cluster with 5 workers. Let’s follow the configuration steps to optimize processing.

Partitions:
Since Spark processes 128 MB per partition, the calculation for the number of partitions would be:
Partitions = (100 GB × 1024 MB/GB) / 128 MB = 800 partitions

Resource Allocation:
Let’s assume each worker has 4 cores and 14 GB of memory available for execution. If you have 5 workers, the total memory for processing would be 5 × 14 GB = 70 GB of memory available for Spark.

Execution:
The workload will be distributed across executors in the workers. To avoid issues like overmemory (where allocated memory is insufficient), the number of partitions should be appropriate for the number of cores and memory available, ensuring that each task can be executed without overloading any single executor’s memory.

Cluster Structure

Here’s how the cluster structure and resource management would look:

Spark Cluster
|
├── Worker 1 (Node 1)
│   ├── Executor 1
│   │   ├── Memory: 14 GB
│   │   └── Cores: 4
│   └── System: 2 GB
|
├── Worker 2 (Node 2)
│   ├── Executor 2
│   │   ├── Memory: 14 GB
│   │   └── Cores: 4
│   └── System: 2 GB
|
├── Worker 3 (Node 3)
│   ├── Executor 3
│   │   ├── Memory: 14 GB
│   │   └── Cores: 4
│   └── System: 2 GB
|
├── Worker 4 (Node 4)
│   ├── Executor 4
│   │   ├── Memory: 14 GB
│   │   └── Cores: 4
│   └── System: 2 GB
|
└── Worker 5 (Node 5)
    ├── Executor 5
    │   ├── Memory: 14 GB
    │   └── Cores: 4
    └── System: 2 GB

Details:

Partitions: The 100 GB file will be split into 800 partitions, processed in parallel across the 5 workers, with each executor processing its respective share of partitions.
Memory: Each executor is allocated 14 GB of memory, with 10% reserved for overhead.
Cores: Each worker has 4 cores, allowing up to 4 tasks to be processed simultaneously in each executor.

Conclusion:

A well-configured Spark job requires understanding the relationships between workers, executors, cores, and partitions. By balancing these factors, you can significantly improve performance and resource utilization, preventing bottlenecks and memory issues. By calculating partitions and cores properly, adjusting memory allocation, and using the appropriate transformations, you can efficiently manage large-scale data processing.