PySpark Cheatsheet: The Ultimate Quick Reference for Big Data & Machine Learning

If you are working with big data, distributed computing, or data pipelines, then Apache Spark is likely already on your radar. And when it comes to using Spark with Python, PySpark is the go-to library.

PySpark Cheatsheet: The Ultimate Quick Reference for Big Data & Machine Learning

However, with so many functions and modules (SQL, DataFrames, MLlib, Streaming), remembering everything can be overwhelming. That’s why this Cheatsheet is designed to be a quick reference guide – packed with essential functions, examples, and explanations for beginners and professionals.

What is PySpark?

It is the Python API for Apache Spark, a fast and general-purpose distributed computing framework. It allows you to:

  • Process massive datasets across multiple nodes
  • Perform SQL queries on big data
  • Train scalable machine learning models with MLlib
  • Handle structured, semi-structured, and streaming data

1. Setup & Initialization

from pyspark.sql import SparkSession

# Start Spark Session
spark = SparkSession.builder \
    .appName("PySpark Cheatsheet") \
    .getOrCreate()

# Check version
print(spark.version)

Every PySpark program starts with creating a SparkSession. This represents your connection to the Spark cluster.

2. Creating DataFrames

from pyspark.sql import Row

# From Python list
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# From RDD
rdd = spark.sparkContext.parallelize([Row(Name="David", Age=40)])
df_rdd = spark.createDataFrame(rdd)

DataFrames are the primary abstraction in PySpark for handling structured data similar to pandas DataFrames but distributed across a cluster.

3. Basic DataFrame Operations

df.show()
df.printSchema()
df.select("Name").show()
df.filter(df.Age > 28).show()
df.groupBy("Age").count().show()

These operations let you view, filter and group data. PySpark operations are lazy meaning transformations are executed only when an action (like .show()) is called.

4. SQL with PySpark

df.createOrReplaceTempView("people")

result = spark.sql("SELECT Name, Age FROM people WHERE Age > 28")
result.show()

It allows you to use SQL queries directly on DataFrames which is very convenient for people coming from an SQL background.

5. Reading & Writing Data

# CSV
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)

# Parquet
df_parquet = spark.read.parquet("data.parquet")

# Write
df.write.csv("output.csv", header=True)
df.write.parquet("output.parquet")

It supports multiple file formats: CSV, JSON, Parquet, ORC and even databases. Parquet is preferred for big data because it’s columnar and compressed.

6. Common Transformations

from pyspark.sql.functions import col, when

df2 = df.withColumn("IsAdult", when(col("Age") >= 18, True).otherwise(False))
df2 = df2.withColumnRenamed("Age", "Years")
df2 = df2.drop("IsAdult")

Transformations create new DataFrames from existing ones. They are useful for data cleaning, feature engineering, and ETL processes

7. Aggregations

from pyspark.sql.functions import avg, max, min

df.agg(avg("Age").alias("AverageAge"),
       max("Age").alias("MaxAge"),
       min("Age").alias("MinAge")).show()

Aggregations summarize data across groups – useful for analytics, reports and dashboards.

8. Handling Missing Data

df.na.drop().show()          # Drop missing values
df.na.fill({"Age": 0}).show() # Replace nulls with 0

Missing values are common in real-world datasets. It provides built-in methods to drop or fill them.

9. Joins

data2 = [("Alice", "F"), ("Bob", "M")]
df2 = spark.createDataFrame(data2, ["Name", "Gender"])

df.join(df2, on="Name", how="inner").show()

Joins let you combine data from multiple sources, just like in SQL (inner, left, right, outer).

10. MLlib – Machine Learning

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Feature engineering
assembler = VectorAssembler(inputCols=["Age"], outputCol="features")
df_ml = assembler.transform(df).select("features", "Age")

# Train model
lr = LogisticRegression(labelCol="Age", featuresCol="features")
model = lr.fit(df_ml)

# Predictions
predictions = model.transform(df_ml)
predictions.show()

MLlib is Spark’s machine learning library. It supports classification, regression, clustering and recommendation at scale.

11. Spark Streaming (Real-Time Data)

stream_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

query = stream_df.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Spark Streaming lets you process real-time data streams like logs, Kafka messages, or sensor data.

12. Saving & Loading Models

# Save model
model.save("logistic_model")

# Load model
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("logistic_model")

Like TensorFlow or PyTorch, It’s MLlib models can be saved and reloaded for production use.

13. Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank

windowSpec = Window.partitionBy("Gender").orderBy("Age")

df.withColumn("row_num", row_number().over(windowSpec)) \
  .withColumn("rank", rank().over(windowSpec)) \
  .withColumn("dense_rank", dense_rank().over(windowSpec)) \
  .show()

Window functions let you compute values across a window of rows (e.g., running totals, rankings, moving averages). They’re crucial for analytics.

14. User-Defined Functions (UDFs)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def age_category(age):
    return "Adult" if age >= 18 else "Minor"

udf_age_category = udf(age_category, StringType())

df.withColumn("Category", udf_age_category(df.Age)).show()

UDFs allow you to apply custom Python functions to PySpark DataFrames. Useful when built-in functions don’t cover your use case.

15. Pandas UDFs (Vectorized UDFs)

from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def squared(v):
    return v ** 2

df.withColumn("AgeSquared", squared(df.Age)).show()

Pandas UDFs are faster than normal UDFs because they operate on batches of data using Apache Arrow. They’re great for performance-sensitive applications.

16. Partitioning & Bucketing for Performance

df.write.partitionBy("Gender").parquet("partitioned_output")
df.write.bucketBy(4, "Age").saveAsTable("bucketed_table")

Partitioning and bucketing are data layout techniques that improve query performance, especially for large datasets in data lakes.

17. Broadcasting (Efficient Joins)

from pyspark.sql.functions import broadcast

small_df = spark.createDataFrame([("M", "Male"), ("F", "Female")], ["Gender", "GenderFull"])

df.join(broadcast(small_df), "Gender").show()

When joining a small table with a huge table, broadcasting sends the small one to all worker nodes. This avoids expensive shuffles.

18. Caching & Persistence

df.cache()
df.count()   # Triggers caching
df.unpersist()

Caching keeps a DataFrame in memory for faster repeated access, especially useful in iterative algorithms like ML training.

19. Checkpointing

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
df.checkpoint()

Checkpointing saves intermediate results to disk. It’s used in long-running jobs to avoid recomputation and break lineage chains.

20. Accumulators & Broadcast Variables

acc = spark.sparkContext.accumulator(0)

def count_even(x):
    global acc
    if x % 2 == 0:
        acc += 1

rdd = spark.sparkContext.parallelize(range(10))
rdd.foreach(count_even)
print("Even numbers:", acc.value)


Accumulators help collect global counters across nodes (e.g., counting events). Broadcast variables send read-only shared data to all workers.

21. Graph Processing with GraphFrames

from graphframes import GraphFrame

vertices = spark.createDataFrame([("A", "Alice"), ("B", "Bob")], ["id", "name"])
edges = spark.createDataFrame([("A", "B", "friend")], ["src", "dst", "relationship"])

g = GraphFrame(vertices, edges)
g.vertices.show()
g.edges.show()
g.inDegrees.show()

GraphFrames (an add-on) lets you analyze networks and relationships — useful for social networks, fraud detection, and recommendation systems.

22. Advanced MLlib: Pipelines

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

indexer = StringIndexer(inputCol="Name", outputCol="NameIndex")
assembler = VectorAssembler(inputCols=["Age", "NameIndex"], outputCol="features")
rf = RandomForestClassifier(labelCol="Age", featuresCol="features")

pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(df)
predictions = model.transform(df)

Pipelines chain multiple stages (feature engineering + modeling). They’re best practice for scalable machine learning workflows.

23. Structured Streaming with Aggregations

from pyspark.sql.functions import window

stream_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

agg = stream_df.groupBy(window("timestamp", "10 minutes")).count()

query = agg.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

You can perform real-time analytics with windows (like 10-minute sliding windows) in Spark Streaming — useful for log monitoring, fraud detection, and IoT data.

24. Delta Lake

df.write.format("delta").save("/delta/events")

delta_df = spark.read.format("delta").load("/delta/events")

Delta Lake brings ACID transactions, schema enforcement, and time travel to data lakes essential for production-grade big data pipelines.

Conclusion

This PySpark Cheatsheet is your quick reference for working with big data and distributed ML. From DataFrame basics and SQL queries to MLlib and streaming, this guide has everything you need to accelerate your projects.

Keep it bookmarked and use it whenever you’re building data pipelines, analytics dashboards or large-scale ML models.