What is Spark?
Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.
It supports widely used Programming languages such as:
· Scala
· Python
· Java
· R
It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.
The Spark Components
The following represents Spark Components on High level
· Low Level API — RDD & Distributed Variables
· Structured API — Data-Frames, Datasets and SOL
· Libraries and Ecosystem, Structured Streaming and Advanced Analytics
|
LIBRARIES & ECOSYSTEM |
||||
|
|
STRUCTURED API |
|
||
|
|
LOW LEVEL API |
|
||
The way that Spark works
JOBs, Stages & Tasks
|
Job |
➡️ |
Stage 1 |
➡️ |
Task 1 |
|
➡️ |
Task 2 |
|||
|
➡️ |
State 2 |
➡️ |
Task 3 |
Driver :
· Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.
· Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.
· Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.
Executor :
· Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.
· Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.
Workflow :
· A user submits a job to the driver.
· The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.
· Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.
· Each executor can run multiple tasks in parallel, depending on the number of cores available.
Key Notes :
· Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.
· Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.
· Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.
· Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.
· Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.
· Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.
o Types of Transformations :
§ Narrow Transformation :
· In this type, each input partition contributes to at most one output partition.
· Examples: "map", "filter".
· These transformations do not require data shuffling across the cluster.
§ Wide Transformation :
· In this type, a single input partition can contribute to multiple output partitions.
· Examples: "groupBy", "join", "reduceByKey".
· These transformations often involve data shuffling, which can be resource intensive.
· Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.
o Types of Actions :
o View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.
o Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).
o Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.
βΉ️ Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to:
o Optimize the execution plan.
o Combine multiple operations for efficiency.
o Use cluster resources effectively.
o By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.
· Spark Session :
o Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application.
o Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions.
o Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources.
o One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.
Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.
The Basic Code ⬇️
What is Spark?
Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.
It supports widely used Programming languages such as:
· Scala
· Python
· Java
· R
It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.
The Spark Components
The following represents Spark Components on High level
· Low Level API — RDD & Distributed Variables
· Structured API — Data-Frames, Datasets and SOL
· Libraries and Ecosystem, Structured Streaming and Advanced Analytics
|
LIBRARIES & ECOSYSTEM |
||||
|
|
STRUCTURED API |
|
||
|
|
LOW LEVEL API |
|
||
The way that Spark works
JOBs, Stages & Tasks
|
Job |
➡️ |
Stage 1 |
➡️ |
Task 1 |
|
➡️ |
Task 2 |
|||
|
➡️ |
State 2 |
➡️ |
Task 3 |
Driver :
· Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.
· Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.
· Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.
Executor :
· Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.
· Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.
Workflow :
· A user submits a job to the driver.
· The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.
· Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.
· Each executor can run multiple tasks in parallel, depending on the number of cores available.
Key Notes :
· Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.
· Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.
· Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.
· Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.
· Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.
· Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.
o Types of Transformations :
§ Narrow Transformation :
· In this type, each input partition contributes to at most one output partition.
· Examples: "map", "filter".
· These transformations do not require data shuffling across the cluster.
§ Wide Transformation :
· In this type, a single input partition can contribute to multiple output partitions.
· Examples: "groupBy", "join", "reduceByKey".
· These transformations often involve data shuffling, which can be resource intensive.
· Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.
o Types of Actions :
o View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.
o Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).
o Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.
βΉ️ Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to:
o Optimize the execution plan.
o Combine multiple operations for efficiency.
o Use cluster resources effectively.
o By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.
· Spark Session :
o Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application.
o Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions.
o Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources.
o One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.
Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.
No comments:
Post a Comment