If you are working with big data, distributed computing, or data pipelines, then Apache Spark is likely already on your radar. And when it comes to using Spark with Python, PySpark is the go-to library.

However, with so many functions and modules (SQL, DataFrames, MLlib, Streaming), remembering everything can be overwhelming. That’s why this Cheatsheet is designed to be a quick reference guide – packed with essential functions, examples, and explanations for beginners and professionals.
Table of Contents
What is PySpark?
It is the Python API for Apache Spark, a fast and general-purpose distributed computing framework. It allows you to:
- Process massive datasets across multiple nodes
- Perform SQL queries on big data
- Train scalable machine learning models with MLlib
- Handle structured, semi-structured, and streaming data
1. Setup & Initialization
from pyspark.sql import SparkSession
# Start Spark Session
spark = SparkSession.builder \
.appName("PySpark Cheatsheet") \
.getOrCreate()
# Check version
print(spark.version)
Every PySpark program starts with creating a SparkSession. This represents your connection to the Spark cluster.
2. Creating DataFrames
from pyspark.sql import Row
# From Python list
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
# From RDD
rdd = spark.sparkContext.parallelize([Row(Name="David", Age=40)])
df_rdd = spark.createDataFrame(rdd)
DataFrames are the primary abstraction in PySpark for handling structured data similar to pandas DataFrames but distributed across a cluster.
3. Basic DataFrame Operations
df.show()
df.printSchema()
df.select("Name").show()
df.filter(df.Age > 28).show()
df.groupBy("Age").count().show()
These operations let you view, filter and group data. PySpark operations are lazy meaning transformations are executed only when an action (like .show()) is called.
4. SQL with PySpark
df.createOrReplaceTempView("people")
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 28")
result.show()
It allows you to use SQL queries directly on DataFrames which is very convenient for people coming from an SQL background.
5. Reading & Writing Data
# CSV
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)
# Parquet
df_parquet = spark.read.parquet("data.parquet")
# Write
df.write.csv("output.csv", header=True)
df.write.parquet("output.parquet")
It supports multiple file formats: CSV, JSON, Parquet, ORC and even databases. Parquet is preferred for big data because it’s columnar and compressed.
6. Common Transformations
from pyspark.sql.functions import col, when
df2 = df.withColumn("IsAdult", when(col("Age") >= 18, True).otherwise(False))
df2 = df2.withColumnRenamed("Age", "Years")
df2 = df2.drop("IsAdult")
Transformations create new DataFrames from existing ones. They are useful for data cleaning, feature engineering, and ETL processes
7. Aggregations
from pyspark.sql.functions import avg, max, min
df.agg(avg("Age").alias("AverageAge"),
max("Age").alias("MaxAge"),
min("Age").alias("MinAge")).show()
Aggregations summarize data across groups – useful for analytics, reports and dashboards.
8. Handling Missing Data
df.na.drop().show() # Drop missing values
df.na.fill({"Age": 0}).show() # Replace nulls with 0
Missing values are common in real-world datasets. It provides built-in methods to drop or fill them.
9. Joins
data2 = [("Alice", "F"), ("Bob", "M")]
df2 = spark.createDataFrame(data2, ["Name", "Gender"])
df.join(df2, on="Name", how="inner").show()
Joins let you combine data from multiple sources, just like in SQL (inner, left, right, outer).
10. MLlib – Machine Learning
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Feature engineering
assembler = VectorAssembler(inputCols=["Age"], outputCol="features")
df_ml = assembler.transform(df).select("features", "Age")
# Train model
lr = LogisticRegression(labelCol="Age", featuresCol="features")
model = lr.fit(df_ml)
# Predictions
predictions = model.transform(df_ml)
predictions.show()
MLlib is Spark’s machine learning library. It supports classification, regression, clustering and recommendation at scale.
11. Spark Streaming (Real-Time Data)
stream_df = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
query = stream_df.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()
Spark Streaming lets you process real-time data streams like logs, Kafka messages, or sensor data.
12. Saving & Loading Models
# Save model
model.save("logistic_model")
# Load model
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("logistic_model")
Like TensorFlow or PyTorch, It’s MLlib models can be saved and reloaded for production use.
13. Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank
windowSpec = Window.partitionBy("Gender").orderBy("Age")
df.withColumn("row_num", row_number().over(windowSpec)) \
.withColumn("rank", rank().over(windowSpec)) \
.withColumn("dense_rank", dense_rank().over(windowSpec)) \
.show()
Window functions let you compute values across a window of rows (e.g., running totals, rankings, moving averages). They’re crucial for analytics.
14. User-Defined Functions (UDFs)
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def age_category(age):
return "Adult" if age >= 18 else "Minor"
udf_age_category = udf(age_category, StringType())
df.withColumn("Category", udf_age_category(df.Age)).show()
UDFs allow you to apply custom Python functions to PySpark DataFrames. Useful when built-in functions don’t cover your use case.
15. Pandas UDFs (Vectorized UDFs)
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def squared(v):
return v ** 2
df.withColumn("AgeSquared", squared(df.Age)).show()
Pandas UDFs are faster than normal UDFs because they operate on batches of data using Apache Arrow. They’re great for performance-sensitive applications.
16. Partitioning & Bucketing for Performance
df.write.partitionBy("Gender").parquet("partitioned_output")
df.write.bucketBy(4, "Age").saveAsTable("bucketed_table")
Partitioning and bucketing are data layout techniques that improve query performance, especially for large datasets in data lakes.
17. Broadcasting (Efficient Joins)
from pyspark.sql.functions import broadcast
small_df = spark.createDataFrame([("M", "Male"), ("F", "Female")], ["Gender", "GenderFull"])
df.join(broadcast(small_df), "Gender").show()
When joining a small table with a huge table, broadcasting sends the small one to all worker nodes. This avoids expensive shuffles.
18. Caching & Persistence
df.cache()
df.count() # Triggers caching
df.unpersist()
Caching keeps a DataFrame in memory for faster repeated access, especially useful in iterative algorithms like ML training.
19. Checkpointing
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
df.checkpoint()
Checkpointing saves intermediate results to disk. It’s used in long-running jobs to avoid recomputation and break lineage chains.
20. Accumulators & Broadcast Variables
acc = spark.sparkContext.accumulator(0)
def count_even(x):
global acc
if x % 2 == 0:
acc += 1
rdd = spark.sparkContext.parallelize(range(10))
rdd.foreach(count_even)
print("Even numbers:", acc.value)
Accumulators help collect global counters across nodes (e.g., counting events). Broadcast variables send read-only shared data to all workers.
21. Graph Processing with GraphFrames
from graphframes import GraphFrame
vertices = spark.createDataFrame([("A", "Alice"), ("B", "Bob")], ["id", "name"])
edges = spark.createDataFrame([("A", "B", "friend")], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
g.vertices.show()
g.edges.show()
g.inDegrees.show()
GraphFrames (an add-on) lets you analyze networks and relationships — useful for social networks, fraud detection, and recommendation systems.
22. Advanced MLlib: Pipelines
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
indexer = StringIndexer(inputCol="Name", outputCol="NameIndex")
assembler = VectorAssembler(inputCols=["Age", "NameIndex"], outputCol="features")
rf = RandomForestClassifier(labelCol="Age", featuresCol="features")
pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(df)
predictions = model.transform(df)
Pipelines chain multiple stages (feature engineering + modeling). They’re best practice for scalable machine learning workflows.
23. Structured Streaming with Aggregations
from pyspark.sql.functions import window
stream_df = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
agg = stream_df.groupBy(window("timestamp", "10 minutes")).count()
query = agg.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
You can perform real-time analytics with windows (like 10-minute sliding windows) in Spark Streaming — useful for log monitoring, fraud detection, and IoT data.
24. Delta Lake
df.write.format("delta").save("/delta/events")
delta_df = spark.read.format("delta").load("/delta/events")
Delta Lake brings ACID transactions, schema enforcement, and time travel to data lakes essential for production-grade big data pipelines.
Conclusion
This PySpark Cheatsheet is your quick reference for working with big data and distributed ML. From DataFrame basics and SQL queries to MLlib and streaming, this guide has everything you need to accelerate your projects.
Keep it bookmarked and use it whenever you’re building data pipelines, analytics dashboards or large-scale ML models.
Related Reads
- Mastering the Maximum Average Subarray I Problem in Python – LeetCode 75 Explained
- Mastering the Maximum Number of Vowels in a Substring Problem in Python – LeetCode 75 Explained
- Mastering the Container With Most Water Problem in Python – LeetCode 75 Explained
- Mastering the Max Number of K-Sum Pairs Problem in Python – LeetCode 75 Explained
- SciPy Cheatsheet: The Ultimate Quick Reference Guide for Python Scientific Computing
4 thoughts on “PySpark Cheatsheet: The Ultimate Quick Reference for Big Data & Machine Learning”