Here in this part of the series, we will cover some basic PySpark operations.
here we'll see how to :
here we'll see how to :
RDD is the fundamental data structure in Apache Spark. It is an immutable, distributed collection of objects that can be processed in parallel across a cluster. RDDs provide fault tolerance through lineage information, allowing Spark to recompute lost data if a node fails.
In spark, getNumPartitions() is a method used to retrieve the number of partitions in an RDD (Resilient Distributed Dataset). Partitions are the fundamental units of parallelism in Spark, and understanding the number of partitions is crucial for optimizing performance and resource utilization.
| Parallelism | Performance Tuning | Debugging |
|---|---|---|
| More partitions can lead to better parallelism, allowing Spark to utilize more CPU cores and memory across the cluster. More partitions mean more tasks can run in parallel, but too many partitions can lead to overhead. |
Understanding the number of partitions helps in tuning the
performance of Spark applications. Too few partitions may lead to
underutilization of resources, while too many partitions can cause
overhead due to task scheduling.By knowing the number of partitions,
you can optimize your Spark job by:
|
When debugging Spark applications, knowing the number of partitions can help identify issues related to data distribution and processing. It helps in understanding how the data is distributed and whether the partitioning strategy is effective. |
val numPartitions = rdd.getNumPartitions()
parallelize() or textFile().
Here is a simple example of how to use the getNumPartitions() method in Spark:
// Create an RDD from a list of numbers
val data = List(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data, 3) // Create RDD with 3 partitions
// Get the number of partitions
val numPartitions = rdd.getNumPartitions()
// Print the number of partitions
println(s"Number of partitions: $numPartitions")
In this example, we create an RDD with 3 partitions and then use the getNumPartitions() method to retrieve and print the number of partitions.
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "RDD Example")
# Create an RDD from a list of numbers
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, 4) # Create an RDD with 4 partitions
# Get the number of partitions
num_partitions = rdd.getNumPartitions()
# Print the number of partitions
print("Number of partitions:", num_partitions)
# Stop the SparkContext
sc.stop()
repartition() or coalesce().
rdd_repartitioned = rdd.repartition(8) # Increase partitions to 8
print("New number of partitions:", rdd_repartitioned.getNumPartitions())
getNumPartitions()?DataFrames are the preferred way to work with structured data (like CSV files) in Spark because they offer optimizations and a higher-level API. RDDs are more low-level and are typically used for unstructured data or custom transformations.
By default, the number of partitions is determined by the size of the
data and the cluster configuration. You can adjust it using
repartition() or coalesce() methods.
Converting a DataFrame to an RDD can be useful for advanced transformations, but it may lose some of the optimizations provided by the DataFrame API.
Repartitioning can help balance the workload across the cluster, especially if the data is skewed. It can also improve performance by reducing the amount of data shuffled during wide transformations.
The optimal number of partitions depends on the size of the data, the cluster configuration, and the nature of the transformations being performed. A common rule of thumb is to have 2-4 partitions per CPU core in the cluster.
The getNumPartitions() method is a useful tool for understanding the distribution of data in an RDD and optimizing Spark applications for better performance.
Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.
It supports widely used Programming languages such as:
· Scala
· Python
· Java
· R
It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.
The Spark Components
The following represents Spark Components on High level
· Low Level API — RDD & Distributed Variables
· Structured API — Data-Frames, Datasets and SOL
· Libraries and Ecosystem, Structured Streaming and Advanced Analytics
|
LIBRARIES & ECOSYSTEM |
||||
|
|
STRUCTURED API |
|
||
|
|
LOW LEVEL API |
|
||
The way that Spark works
JOBs, Stages & Tasks
|
Job |
➡️ |
Stage 1 |
➡️ |
Task 1 |
|
➡️ |
Task 2 |
|||
|
➡️ |
State 2 |
➡️ |
Task 3 |
Driver :
· Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.
· Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.
· Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.
Executor :
· Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.
· Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.
Workflow :
· A user submits a job to the driver.
· The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.
· Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.
· Each executor can run multiple tasks in parallel, depending on the number of cores available.
Key Notes :
· Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.
· Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.
· Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.
· Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.
· Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.
· Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.
o Types of Transformations :
§ Narrow Transformation :
· In this type, each input partition contributes to at most one output partition.
· Examples: "map", "filter".
· These transformations do not require data shuffling across the cluster.
§ Wide Transformation :
· In this type, a single input partition can contribute to multiple output partitions.
· Examples: "groupBy", "join", "reduceByKey".
· These transformations often involve data shuffling, which can be resource intensive.
· Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.
o Types of Actions :
o View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.
o Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).
o Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.
βΉ️ Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to:
o Optimize the execution plan.
o Combine multiple operations for efficiency.
o Use cluster resources effectively.
o By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.
· Spark Session :
o Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application.
o Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions.
o Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources.
o One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.
Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.
Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.
It supports widely used Programming languages such as:
· Scala
· Python
· Java
· R
It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.
The Spark Components
The following represents Spark Components on High level
· Low Level API — RDD & Distributed Variables
· Structured API — Data-Frames, Datasets and SOL
· Libraries and Ecosystem, Structured Streaming and Advanced Analytics
|
LIBRARIES & ECOSYSTEM |
||||
|
|
STRUCTURED API |
|
||
|
|
LOW LEVEL API |
|
||
The way that Spark works
JOBs, Stages & Tasks
|
Job |
➡️ |
Stage 1 |
➡️ |
Task 1 |
|
➡️ |
Task 2 |
|||
|
➡️ |
State 2 |
➡️ |
Task 3 |
Driver :
· Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.
· Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.
· Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.
Executor :
· Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.
· Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.
Workflow :
· A user submits a job to the driver.
· The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.
· Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.
· Each executor can run multiple tasks in parallel, depending on the number of cores available.
Key Notes :
· Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.
· Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.
· Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.
· Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.
· Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.
· Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.
o Types of Transformations :
§ Narrow Transformation :
· In this type, each input partition contributes to at most one output partition.
· Examples: "map", "filter".
· These transformations do not require data shuffling across the cluster.
§ Wide Transformation :
· In this type, a single input partition can contribute to multiple output partitions.
· Examples: "groupBy", "join", "reduceByKey".
· These transformations often involve data shuffling, which can be resource intensive.
· Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.
o Types of Actions :
o View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.
o Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).
o Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.
βΉ️ Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to:
o Optimize the execution plan.
o Combine multiple operations for efficiency.
o Use cluster resources effectively.
o By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.
· Spark Session :
o Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application.
o Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions.
o Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources.
o One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.
Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.
Below is the second sets of code using pyScripts/Python that helps
transfer data from all CSV files in a folder to a single spreadsheet,
with each CSV file being placed into a separate sheet.
The output of this code is same but I used earlier using pandas
however I am here using pyspark to get same output.
The reason for performing this task is that, in most cases, we need to
import Excel files as datasets for use in reporting or dashboard
tools. If we import each CSV file individually, we may need to
establish multiple connections. Additionally, relationships (ER) often
don’t support interconnections between these multiple datasets. By
using this script, we streamline the process by first converting the
data types and then exporting the data into a single spreadsheet with
multiple sheets, making it easier to work with the data in one unified
format.
The Sample Data |
|
|---|---|
| Blink !t Orders | Blink !t Order Items |
| Blink !t Customers | Blink !t Products |
Below is the Pandas/Python code that helps transfer data from all CSV
files in a folder to a single spreadsheet, with each CSV file being
placed into a separate sheet.
The reason for performing this task is that, in most cases, we need to
import Excel files as datasets for use in reporting or dashboard tools.
If we import each CSV file individually, we may need to establish
multiple connections. Additionally, relationships (ER) often don’t
support interconnections between these multiple datasets. By using this
script, we streamline the process by first converting the data types and
then exporting the data into a single spreadsheet with multiple sheets,
making it easier to work with the data in one unified format.