March 12, 2025

Spark Practices Chapter 3

Here in this part of the series, we will cover some basic PySpark operations.

here we'll see how to :

  • Create a Spark Session
  • Create a DataFrame
  • Read a CSV file
  • Filter on records
  • Calculate Average
  • See list of columns and DataTypes
  • March 11, 2025

    Spark Practices Chapter 2

    Spark Practices Chapter 2

    RDD (Resilient Distributed Dataset)

    What is RDD?

    RDD is the fundamental data structure in Apache Spark. It is an immutable, distributed collection of objects that can be processed in parallel across a cluster. RDDs provide fault tolerance through lineage information, allowing Spark to recompute lost data if a node fails.

    In spark, getNumPartitions() is a method used to retrieve the number of partitions in an RDD (Resilient Distributed Dataset). Partitions are the fundamental units of parallelism in Spark, and understanding the number of partitions is crucial for optimizing performance and resource utilization.

    What Does getNumPartitions() Do?

    • It returns an integer representing the number of partitions in the RDD.
    • The getNumPartitions() method returns the total number of partitions that an RDD is divided into.
    • Partitions determine how the data is distributed across the cluster and how many tasks will be executed in parallel.
    • The number of partitions can impact the performance of Spark applications, as it affects the level of parallelism and the distribution of data across the cluster.

    Why is it Important?

    Parallelism Performance Tuning Debugging
    More partitions can lead to better parallelism, allowing Spark to utilize more CPU cores and memory across the cluster. More partitions mean more tasks can run in parallel, but too many partitions can lead to overhead. Understanding the number of partitions helps in tuning the performance of Spark applications. Too few partitions may lead to underutilization of resources, while too many partitions can cause overhead due to task scheduling.By knowing the number of partitions, you can optimize your Spark job by:
    • Increasing partitions if the data is skewed or if there are too few tasks.
    • Decreasing partitions if there are too many small tasks, which can cause overhead.
    When debugging Spark applications, knowing the number of partitions can help identify issues related to data distribution and processing. It helps in understanding how the data is distributed and whether the partitioning strategy is effective.

    Key Points about getNumPartitions():

    • Return Type: The method returns an integer value representing the number of partitions in the RDD.
    • Usage: You can call this method on any RDD object to determine how many partitions it has. For example:
      val numPartitions = rdd.getNumPartitions()
    • Importance: Knowing the number of partitions helps in understanding how data is distributed across the cluster. It can impact performance, as having too few partitions may lead to underutilization of resources, while too many partitions can cause overhead due to task scheduling.
    • Default Behavior: When an RDD is created, Spark automatically determines the number of partitions based on the input data source and the cluster configuration. However, you can also specify the number of partitions explicitly when creating an RDD using methods like parallelize() or textFile().

    Example Usage of getNumPartitions()

    Here is a simple example of how to use the getNumPartitions() method in Spark:

            
              // Create an RDD from a list of numbers
              val data = List(1, 2, 3, 4, 5)
              val rdd = sc.parallelize(data, 3) // Create RDD with 3 partitions
    
              // Get the number of partitions
              val numPartitions = rdd.getNumPartitions()
    
              // Print the number of partitions
              println(s"Number of partitions: $numPartitions")
            
          

    In this example, we create an RDD with 3 partitions and then use the getNumPartitions() method to retrieve and print the number of partitions.

            
                from pyspark import SparkContext
    
                # Initialize SparkContext
                sc = SparkContext("local", "RDD Example")
    
                # Create an RDD from a list of numbers
                data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
                rdd = sc.parallelize(data, 4)  # Create an RDD with 4 partitions
    
                # Get the number of partitions
                num_partitions = rdd.getNumPartitions()
    
                # Print the number of partitions
                print("Number of partitions:", num_partitions)
                # Stop the SparkContext
                sc.stop()
    
            
        

    The output will be:
    Number of partitions: 4
    This indicates that the RDD has been divided into 4 partitions.

    Keypoints:

    • Default Partitioning:If you don't specify the number of partitions when creating an RDD, Spark uses the default value, which is typically based on the cluster configuration or the size of the input data.
    • Repartitioning:You can change the number of partitions using methods like repartition() or coalesce().
                          
      rdd_repartitioned = rdd.repartition(8)  # Increase partitions to 8
      print("New number of partitions:", rdd_repartitioned.getNumPartitions())
                          
                      
    • The number of partitions can impact the performance of Spark applications, as it affects the level of parallelism and resource utilization.
    • Understanding the number of partitions helps in tuning the performance of Spark applications by optimizing resource usage and minimizing overhead.
    • Knowing the number of partitions can also aid in debugging Spark applications by identifying issues related to data distribution and processing.
    • Impact on Shuffling: Wide transformations (e.g., groupByKey, reduceByKey) may change the number of partitions due to shuffling.

    When to Use getNumPartitions()?

    • When you want to inspect the partitioning of your RDD.
    • When you need to optimize performance by adjusting the number of partitions.
    • When debugging skewed data distribution or task imbalances.

    DataFrame vs RDD

    DataFrames are the preferred way to work with structured data (like CSV files) in Spark because they offer optimizations and a higher-level API. RDDs are more low-level and are typically used for unstructured data or custom transformations.

    Partitioning

    By default, the number of partitions is determined by the size of the data and the cluster configuration. You can adjust it using repartition() or coalesce() methods.

    Performance

    Converting a DataFrame to an RDD can be useful for advanced transformations, but it may lose some of the optimizations provided by the DataFrame API.

    Why Repartitioning is Important?

    Repartitioning can help balance the workload across the cluster, especially if the data is skewed. It can also improve performance by reducing the amount of data shuffled during wide transformations.

    • Parallelism: More partitions allow Spark to process data in parallel, improving performance.
    • Resource Utilization: Proper partitioning ensures that all cluster resources (e.g., CPU cores) are utilized efficiently.
    • Avoid Skew: Repartitioning can help avoid data skew, where some partitions are much larger than others.

    How to Choose the Number of Partitions?

    The optimal number of partitions depends on the size of the data, the cluster configuration, and the nature of the transformations being performed. A common rule of thumb is to have 2-4 partitions per CPU core in the cluster.

    • Rule of Thumb: A good starting point is to have 2-4 partitions per CPU core in your cluster.
    • Dataset Size: For larger datasets, increase the number of partitions to ensure each partition is of a manageable size (e.g., 128MB-1GB per partition).
    • Cluster Configuration: Consider the total number of cores and memory available in your cluster.
    • Experimentation: It may require some experimentation to find the optimal number of partitions for your specific workload.

    Conclusion

    The getNumPartitions() method is a useful tool for understanding the distribution of data in an RDD and optimizing Spark applications for better performance.

    The Sample Code

    March 09, 2025

    Spark Practices

    About Spark

    What is Spark?

    Spark is an Open-Source Unified Computing Engine with set of libraries for parallel data processing on Computer Cluster.

    It supports widely used Programming languages such as:

    ·         Scala

    ·         Python

    ·         Java

    ·         R

    It processes data in memory (RAM) which makes it 100 times faster than traditional Hadoop Map Reduce.

     

    The Spark Components

    The following represents Spark Components on High level

    ·         Low Level API — RDD & Distributed Variables

    ·         Structured API — Data-Frames, Datasets and SOL

    ·         Libraries and Ecosystem, Structured Streaming and Advanced Analytics

    LIBRARIES & ECOSYSTEM

     

    STRUCTURED API

     
     

    LOW LEVEL API

     

     

    The way that Spark works

    JOBs, Stages & Tasks

    Job

    ➡️

    Stage 1

    ➡️

    Task 1

    ➡️

    Task 2

    ➡️

    State 2

    ➡️

    Task 3

     

    Driver :

    ·         Heart of the Spark Application : The driver is the central component of a Spark application. It is responsible for coordinating and managing the overall execution of the application.

    ·         Manages Executor Information and State : The driver keeps track of the status and details of all executors, ensuring efficient resource utilization and task allocation.

    ·         Analyzes, Distributes, and Schedules Work : The driver analyzes the job, breaks it down into smaller units of work (stages and tasks), and schedules these tasks across the available executors.

    Executor :

    ·         Executes the Code : Executors are responsible for running the tasks assigned to them by the driver. They execute the actual computation on the data.

    ·         Reports Execution Status to the Driver : Executors continuously communicate with the driver, providing updates on the status of task execution and any issues encountered.

    Workflow :

    ·         A user submits a job to the driver.

    ·         The driver analyzes the job, divides it into stages and tasks, and assigns these tasks to the executors.

    ·         Executors are JVM processes running on cluster machines. They host cores, which are responsible for executing tasks.

    ·         Each executor can run multiple tasks in parallel, depending on the number of cores available.

    Key Notes :

    ·         Task and Partition Relationship : Each task can process only one partition of data at a time. This ensures efficient and parallel data processing.

    ·         Parallel Execution : Tasks can be executed in parallel, allowing Spark to handle large-scale data processing efficiently.

    ·         Executors as JVM Processes : Executors are JVM processes running on cluster machines. They are responsible for executing tasks and managing resources.

    ·         Core and Task Execution : Each executor hosts multiple cores, and each core can run one task at a time. This design enables high concurrency and optimal resource utilization.


     

    ·         Partition : To enable parallel processing across executors, Spark divides the data into smaller chunks called partitions. Each partition is processed independently, allowing multiple executors to work simultaneously.

    ·         Transformation : A transformation is an operation or instruction that modifies or transforms the data. Transformations are used to build the logical execution plan in Spark. Examples include "select", "where", "groupBy", etc.

    o   Types of Transformations :

    §  Narrow Transformation :

    ·         In this type, each input partition contributes to at most one output partition.

    ·         Examples: "map", "filter".

    ·         These transformations do not require data shuffling across the cluster.

    §  Wide Transformation :

    ·         In this type, a single input partition can contribute to multiple output partitions.

    ·         Examples: "groupBy", "join", "reduceByKey".

    ·         These transformations often involve data shuffling, which can be resource intensive.

    ·         Actions : An action triggers the execution of the logical plan created by transformations. Actions initiate the actual computation and return results to the driver or write data to an output source.

    o   Types of Actions :

    o   View Data in Console : Actions like "show()" or "display()" allow you to view data in the console.

    o   Collect Data to Native Language : Actions like "collect()" or "take()" bring data back to the driver program in the native language (e.g., Python, Scala).

    o   Write Data to Output Data Sources : Actions like "write.csv()", "write.parquet()", or "saveAsTable()" save the processed data to external storage systems.

    β„Ή️         Lazy Evaluation in Spark : Spark employs lazy evaluation, meaning it delays the execution of transformations until an action is called. This allows Spark to: 

    o   Optimize the execution plan.

    o   Combine multiple operations for efficiency. 

    o   Use cluster resources effectively. 

    o   By waiting until the last moment to execute, Spark ensures optimal performance and resource utilization.

    ·         Spark Session :

    o   Driver Process as Spark Session : The Spark Session is the driver process and serves as the entry point for any Spark application. 

    o   Entry Point for Execution : It is the starting point for interacting with Spark's functionalities, such as reading data, executing transformations, and triggering actions. 

    o   Cluster Execution : The Spark Session instance executes the code on the cluster, coordinating tasks and managing resources. 

    o   One-to-One Relationship : There is a one-to-one relationship between a Spark Application and a Spark Session. For every Spark Application, there is exactly one Spark Session instance.

    Here are some basic commands of Python pySpark that will help you to know about this language. I will keep posting other nodes too soon on this section.

    The Basic Code ⬇️

    March 06, 2025

    Sample Charts based on last Import to Spreadsheet

    Table Structure

    The Table Structure:

    Based on the last import data, we are using only 4 tables here. The ER diagram is:

    The Sample Data

    Blink !t Orders Blink !t Order Items
    Blink !t Customers Blink !t Products

    The Dashboard

    CSV's (All in folder) to Spreadsheet

    CSV to Spreadsheet

    Below is the Pandas/Python code that helps transfer data from all CSV files in a folder to a single spreadsheet, with each CSV file being placed into a separate sheet.
    The reason for performing this task is that, in most cases, we need to import Excel files as datasets for use in reporting or dashboard tools. If we import each CSV file individually, we may need to establish multiple connections. Additionally, relationships (ER) often don’t support interconnections between these multiple datasets. By using this script, we streamline the process by first converting the data types and then exporting the data into a single spreadsheet with multiple sheets, making it easier to work with the data in one unified format.


    The script