Spark Performance Tuning on Databricks: 10 Fixes That Actually Matter

Spark Performance Tuning on Databricks: 10 Fixes That Actually Matter

Celestinfo Software Solutions Pvt. Ltd. Aug 28, 2025

Quick answer: The 10 highest-impact Spark optimizations on Databricks are: right-size your cluster, use Delta cache, broadcast small tables in joins, replace row-at-a-time UDFs with Pandas UDFs, fix your partition strategy, avoid collect() on large datasets, push predicates down early, Z-order Delta tables, leverage Adaptive Query Execution (AQE), and monitor with Spark UI. Most teams get 3–10x improvements from these fixes without touching their business logic.

Last updated: September 2025

Introduction

Slow Spark jobs aren't usually a Spark problem. They're a configuration problem. We've seen teams double their cluster size 3 times before looking at the Spark UI, only to discover that 95% of the runtime was one skewed partition. The fixes below aren't theoretical - they're the 10 things that actually move the needle when we profile real Databricks workloads.


1. Choose the Right Cluster Size


Start small and scale up based on data. A common anti-pattern is launching a 16-node cluster for a job that processes 2 GB of data. That cluster spends more time on Spark overhead (task scheduling, shuffle coordination) than actual computation.


Before: 16-node i3.xlarge cluster, job runs in 8 minutes (mostly shuffle overhead).

After: 4-node i3.xlarge cluster, same job runs in 6 minutes. Cheaper and faster.


The rule: check the Spark UI after each run. If most tasks complete in under 100ms, your cluster is oversized. If tasks are taking 10+ minutes and you see spill-to-disk, you need more memory per executor or more nodes.


2. Use Delta Cache


Delta cache stores copies of remote Parquet files on the local SSDs of your cluster nodes. Subsequent reads of the same data skip the network round-trip to cloud storage entirely. On Delta-heavy workloads, this alone can cut query times by 50–70%.


SQL - Delta Cache
-- Explicitly cache a table
CACHE SELECT * FROM production.sales.orders;

-- Or enable auto-caching (on by default for Delta tables on i3/r5d instances)
SET spark.databricks.io.cache.enabled = true;

Before: Repeated query on a 50 GB Delta table - 45 seconds (cold read from S3).

After: Same query, second run - 8 seconds (read from local SSD cache).


3. Broadcast Joins for Small Tables


When you join a large table with a small lookup table, Spark's default behavior is a sort-merge join that shuffles both tables across the cluster. Broadcasting the small table to every executor eliminates the shuffle entirely.


PySpark - Broadcast Join
from pyspark.sql.functions import broadcast

# Before: sort-merge join (shuffles both tables)
result = orders_df.join(products_df, "product_id")

# After: broadcast join (sends products_df to all executors)
result = orders_df.join(broadcast(products_df), "product_id")

Config - Auto-broadcast threshold
# Default is 10MB. Increase if your lookup tables are larger.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")

Before: Sort-merge join of 500M-row orders with 10K-row products - 4 minutes (shuffle-heavy).

After: Broadcast join - 35 seconds. The 10K-row products table fits easily in memory.


4. Avoid UDFs When Possible (Use Pandas UDFs)


Row-at-a-time Python UDFs are the single worst performance killer in PySpark. Each row gets serialized from the JVM to Python, processed, and serialized back. Pandas UDFs (vectorized UDFs) process entire columns at once using Apache Arrow, running 10–100x faster.


PySpark - Row UDF vs Pandas UDF
# SLOW: Row-at-a-time UDF (serializes every row individually)
@udf("double")
def calc_discount_slow(price, category):
    return price * 0.1 if category == "clearance" else price * 0.05

# FAST: Pandas UDF (processes entire column vectors)
@pandas_udf("double")
def calc_discount_fast(price: pd.Series, category: pd.Series) -> pd.Series:
    return price.where(category != "clearance", price * 0.1).where(
        category == "clearance", price * 0.05
    )

Before: Row UDF on 100M rows - 12 minutes.

After: Pandas UDF on the same 100M rows - 50 seconds.


Even better: check if your UDF can be replaced entirely with built-in Spark SQL functions. CASE WHEN and when().otherwise() run natively in the JVM with zero serialization overhead.


5. Fix Your Partition Strategy


Two problems: too many small files (thousands of tiny partitions) and too few large files (one partition doing all the work). Both kill performance.


PySpark - Partition Management
# Too many small files? Coalesce to reduce partition count.
df.coalesce(10).write.format("delta").save("/mnt/output/")

# Need to redistribute data before a wide transformation?
# repartition() triggers a full shuffle but ensures even distribution.
df.repartition(200, "customer_id").groupBy("customer_id").agg(...)

Before: Write produces 4,000 files averaging 2 MB each. Downstream reads spend 80% of time on file listing overhead.

After: coalesce(50) produces 50 files averaging 160 MB each. Downstream reads are 6x faster.


6. Avoid collect() on Large Datasets


collect() pulls the entire dataset into the driver's memory. On a 10 GB DataFrame, this crashes the driver with an OutOfMemoryError. Use display() in Databricks notebooks (auto-limits to 1,000 rows) or toPandas() with explicit limits.


PySpark - Safe Alternatives
# BAD: crashes driver on large data
all_rows = df.collect()

# GOOD: display first 1000 rows in notebook
display(df)

# GOOD: convert a small subset to Pandas
sample_df = df.limit(10000).toPandas()

7. Predicate Pushdown (Filter Early)


Spark optimizes queries by pushing filter conditions down to the data source. But this only works if you apply filters before transformations that break pushdown (like UDFs or complex joins).


PySpark - Filter Ordering
# SLOW: reads all data, then filters
result = df.join(other_df, "id").filter(col("date") > "2026-01-01")

# FAST: filter first, Spark pushes predicate to Parquet file scan
result = df.filter(col("date") > "2026-01-01").join(other_df, "id")

Before: Full table scan of 2 TB, then filter - 9 minutes.

After: Predicate pushdown skips 80% of Parquet files - 2 minutes.


8. Z-Order Delta Tables for Query Patterns


Z-ordering co-locates related data in the same set of files based on the columns you specify. This enables file-level pruning: when Spark sees a filter on a Z-ordered column, it skips files that don't contain matching values.


SQL - Z-Ordering
-- Z-order by the columns you filter on most
OPTIMIZE production.sales.orders
  ZORDER BY (customer_id, order_date);

-- Check file statistics after optimization
DESCRIBE DETAIL production.sales.orders;

Before: Query filtering on customer_id scans 500 files (no Z-ordering).

After: Same query scans 12 files (Z-ordered on customer_id). 40x less I/O.


Gotcha: Z-ordering works best on columns with high cardinality that you frequently filter on. Don't Z-order on boolean columns (only 2 values) or columns you never filter by.


9. Adaptive Query Execution (AQE)


AQE is enabled by default in Databricks Runtime 7.3+ and re-optimizes the query plan at runtime based on actual data statistics. Three key features:



Config - AQE Settings
-- Verify AQE is enabled (should be by default)
SET spark.sql.adaptive.enabled = true;

-- Enable skew join optimization
SET spark.sql.adaptive.skewJoin.enabled = true;

-- Coalesce shuffle partitions automatically
SET spark.sql.adaptive.coalescePartitions.enabled = true;

Before: 200 shuffle partitions for a 500 MB intermediate result - 190 empty partitions, 10 doing actual work.

After: AQE coalesces to 12 partitions. 40% faster shuffle phase.


10. Monitor with Spark UI and Ganglia


The Spark UI is your primary diagnostic tool. For every slow job, check these three things:



Ganglia (available on Databricks cluster metrics) shows CPU utilization, memory pressure, and network I/O across all nodes. If CPU is below 30% while the job is running, your cluster is I/O bound - look at caching and file pruning.


Common Anti-Pattern: Pandas on a Single-Node Cluster


We see this regularly: a data scientist writes pandas code, runs it on a single-node Databricks cluster, and wonders why it's slow on 50 GB of data. Pandas runs on a single core and loads everything into RAM. For large datasets, use the Pandas API on Spark (formerly Koalas), which provides the familiar pandas syntax but distributes computation across the cluster.


PySpark - Pandas API on Spark
import pyspark.pandas as ps

# Same pandas syntax, distributed across cluster
df = ps.read_parquet("/mnt/data/large_dataset/")
result = df.groupby("category").agg({"revenue": "sum", "orders": "count"})
result.sort_values("revenue", ascending=False).head(20)

Gotcha: Photon Engine Cost vs Performance


Databricks Photon engine rewrites Spark SQL execution in C++ and delivers 2–8x performance gains on scan-heavy, aggregation-heavy SQL workloads. But Photon costs more per DBU - roughly 2x the standard compute rate. Profile your workload before enabling it everywhere. If your job is 70% Python UDFs and 30% SQL scans, Photon only helps with the 30%. Run the same job with and without Photon, compare the total DBU cost, and decide per-job.


Key Takeaways



Mohan, Senior Data Engineer

Mohan is a Senior Data Engineer at CelestInfo who evaluates and compares data platforms, tools, and architectures to help clients choose the right technology stack.

Related Articles

Q: What is the biggest Spark performance mistake on Databricks?

Over-provisioning clusters while ignoring data skew. A 64-node cluster won't help if 90% of the work lands on one executor because of a skewed join key. Check the Spark UI for task duration variance before adding more nodes.

Q: Should I enable Photon engine on all Databricks clusters?

Not automatically. Photon improves SQL and DataFrame performance significantly but costs more per DBU. Profile your workload first: if it's heavy on SQL scans and aggregations, Photon pays for itself. If it's mostly Python UDFs or ML training, the extra cost may not be justified.

Q: When should I use broadcast joins in Spark?

Use broadcast joins when one side of the join is small enough to fit in executor memory (typically under 100 MB–1 GB). Spark avoids the expensive shuffle step by sending the small table to every executor. Set spark.sql.autoBroadcastJoinThreshold to control the threshold.

Q: What does Z-ordering do for Delta tables?

Z-ordering co-locates related data within the same set of files based on the columns you specify. When you query with a filter on a Z-ordered column, Spark can skip entire files that don't contain matching values, reducing I/O by 50–90% for selective queries.

Burning Questions
About CelestInfo

Simple answers to make things clear.

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

Insights are updated in real-time as new data becomes available.

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.