March 09, 2025

Spark Practices

About Spark

What is Spark?

Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.

It supports widely used Programming languages such as:

·         Scala

·         Python

·         Java

·         R

It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.

 

The Spark Components

The following represents Spark Components on High level

·         Low Level API — RDD & Distributed Variables

·         Structured API — Data-Frames, Datasets and SOL

·         Libraries and Ecosystem, Structured Streaming and Advanced Analytics

LIBRARIES & ECOSYSTEM

 

STRUCTURED API

 
 

LOW LEVEL API

 

 

The way that Spark works

JOBs, Stages & Tasks

Job

➡️

Stage 1

➡️

Task 1

➡️

Task 2

➡️

State 2

➡️

Task 3

 

Driver :

·         Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.

·         Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.

·         Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.

Executor :

·         Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.

·         Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.

Workflow :

·         A user submits a job to the driver.

·         The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.

·         Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.

·         Each executor can run multiple tasks in parallel, depending on the number of cores available.

Key Notes :

·         Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.

·         Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.

·         Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.

·         Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.


 

·         Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.

·         Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.

o   Types of Transformations :

§  Narrow Transformation :

·         In this type, each input partition contributes to at most one output partition.

·         Examples: "map", "filter".

·         These transformations do not require data shuffling across the cluster.

§  Wide Transformation :

·         In this type, a single input partition can contribute to multiple output partitions.

·         Examples: "groupBy", "join", "reduceByKey".

·         These transformations often involve data shuffling, which can be resource intensive.

·         Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.

o   Types of Actions :

o   View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.

o   Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).

o   Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.

β„Ή️         Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to: 

o   Optimize the execution plan.

o   Combine multiple operations for efficiency. 

o   Use cluster resources effectively. 

o   By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.

·         Spark Session :

o   Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application. 

o   Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions. 

o   Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources. 

o   One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.

Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.

The Basic Code ⬇️