About Spark

What is Spark?

Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.

It supports widely used Programming languages such as:

· Scala

· Python

· Java

· R

It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.

The Spark Components

The following represents Spark Components on High level

· Low Level API — RDD & Distributed Variables

· Structured API — Data-Frames, Datasets and SOL

· Libraries and Ecosystem, Structured Streaming and Advanced Analytics

LIBRARIES & ECOSYSTEM
	STRUCTURED API
		LOW LEVEL API

The way that Spark works

JOBs, Stages & Tasks

Job	➡️	Stage 1	➡️	Task 1
	➡️	Stage 1	➡️	Task 2
	➡️	State 2	➡️	Task 3

Driver :

· Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.

· Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.

· Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.

Executor :

· Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.

· Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.

Workflow :

· A user submits a job to the driver.

· The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.

· Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.

· Each executor can run multiple tasks in parallel, depending on the number of cores available.

Key Notes :

· Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.

· Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.

· Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.

· Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.

· Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.

· Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.

o Types of Transformations :

§ Narrow Transformation :

· In this type, each input partition contributes to at most one output partition.

· Examples: "map", "filter".

· These transformations do not require data shuffling across the cluster.

§ Wide Transformation :

· In this type, a single input partition can contribute to multiple output partitions.

· Examples: "groupBy", "join", "reduceByKey".

· These transformations often involve data shuffling, which can be resource intensive.

· Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.

o Types of Actions :

o View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.

o Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).

o Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.

ℹ️ Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to:

o Optimize the execution plan.

o Combine multiple operations for efficiency.

o Use cluster resources effectively.

o By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.

· Spark Session :

o Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application.

o Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions.

o Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources.

o One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.

Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.

March 09, 2025

Spark Practices

What is Spark?

The Basic Code ⬇️