October 05, 2023

Parallelism (PySpark)

In PySpark, .master is an attribute used when creating a SparkSession to specify the cluster manager to which the Spark application will connect. It defines where the application will run, whether locally or on a cluster. Some common values for .master include:

  • "local": Runs Spark locally on a single thread. Useful for testing and debugging.
  • "local[n]": Runs Spark locally with n threads. This allows for some level of parallelism on a local machine.
  • "spark://host:port": Connects to a standalone Spark cluster at the specified host and port.
  • "yarn": Connects to a Hadoop YARN cluster.
  • "mesos://host:port": Connects to an Apache Mesos cluster.

The .master attribute is set within the SparkSession.builder when creating a SparkSession object. For instance:

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \

    .appName("My PySpark App") \

    .master("local[*]") \

    .getOrCreate()

 

 

In this example, `"local[*]" `tells Spark to run locally, using as many threads as available cores.

 

No comments:

Post a Comment