Spark Performance Tuning on Databricks: 10 Fixes That Actually Matter

Mohan

Published Aug 28, 2025 · Last updated Feb 2026

Celestinfo Software Solutions Pvt. Ltd. • Aug 28, 2025

Quick answer: The 10 highest-impact Spark optimizations on Databricks are: right-size your cluster, use Delta cache, broadcast small tables in joins, replace row-at-a-time UDFs with Pandas UDFs, fix your partition strategy, avoid collect() on large datasets, push predicates down early, Z-order Delta tables, leverage Adaptive Query Execution (AQE), and monitor with Spark UI. Most teams get 3–10x improvements from these fixes without touching their business logic.

Last updated: September 2025

Introduction

Slow Spark jobs aren't usually a Spark problem. They're a configuration problem. We've seen teams double their cluster size 3 times before looking at the Spark UI, only to discover that 95% of the runtime was one skewed partition. The fixes below aren't theoretical - they're the 10 things that actually move the needle when we profile real Databricks workloads.

1. Choose the Right Cluster Size

Start small and scale up based on data. A common anti-pattern is launching a 16-node cluster for a job that processes 2 GB of data. That cluster spends more time on Spark overhead (task scheduling, shuffle coordination) than actual computation.

Before: 16-node i3.xlarge cluster, job runs in 8 minutes (mostly shuffle overhead).

After: 4-node i3.xlarge cluster, same job runs in 6 minutes. Cheaper and faster.

The rule: check the Spark UI after each run. If most tasks complete in under 100ms, your cluster is oversized. If tasks are taking 10+ minutes and you see spill-to-disk, you need more memory per executor or more nodes.

2. Use Delta Cache

Delta cache stores copies of remote Parquet files on the local SSDs of your cluster nodes. Subsequent reads of the same data skip the network round-trip to cloud storage entirely. On Delta-heavy workloads, this alone can cut query times by 50–70%.

SQL - Delta Cache

-- Explicitly cache a table
CACHE SELECT * FROM production.sales.orders;

-- Or enable auto-caching (on by default for Delta tables on i3/r5d instances)
SET spark.databricks.io.cache.enabled = true;

Before: Repeated query on a 50 GB Delta table - 45 seconds (cold read from S3).

After: Same query, second run - 8 seconds (read from local SSD cache).

3. Broadcast Joins for Small Tables

When you join a large table with a small lookup table, Spark's default behavior is a sort-merge join that shuffles both tables across the cluster. Broadcasting the small table to every executor eliminates the shuffle entirely.

PySpark - Broadcast Join

from pyspark.sql.functions import broadcast

# Before: sort-merge join (shuffles both tables)
result = orders_df.join(products_df, "product_id")

# After: broadcast join (sends products_df to all executors)
result = orders_df.join(broadcast(products_df), "product_id")

Config - Auto-broadcast threshold

# Default is 10MB. Increase if your lookup tables are larger.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")

Before: Sort-merge join of 500M-row orders with 10K-row products - 4 minutes (shuffle-heavy).

After: Broadcast join - 35 seconds. The 10K-row products table fits easily in memory.

4. Avoid UDFs When Possible (Use Pandas UDFs)

Row-at-a-time Python UDFs are the single worst performance killer in PySpark. Each row gets serialized from the JVM to Python, processed, and serialized back. Pandas UDFs (vectorized UDFs) process entire columns at once using Apache Arrow, running 10–100x faster.

PySpark - Row UDF vs Pandas UDF

# SLOW: Row-at-a-time UDF (serializes every row individually)
@udf("double")
def calc_discount_slow(price, category):
    return price * 0.1 if category == "clearance" else price * 0.05

# FAST: Pandas UDF (processes entire column vectors)
@pandas_udf("double")
def calc_discount_fast(price: pd.Series, category: pd.Series) -> pd.Series:
    return price.where(category != "clearance", price * 0.1).where(
        category == "clearance", price * 0.05
    )

Before: Row UDF on 100M rows - 12 minutes.

After: Pandas UDF on the same 100M rows - 50 seconds.

Even better: check if your UDF can be replaced entirely with built-in Spark SQL functions. CASE WHEN and when().otherwise() run natively in the JVM with zero serialization overhead.

5. Fix Your Partition Strategy

Two problems: too many small files (thousands of tiny partitions) and too few large files (one partition doing all the work). Both kill performance.

PySpark - Partition Management

# Too many small files? Coalesce to reduce partition count.
df.coalesce(10).write.format("delta").save("/mnt/output/")

# Need to redistribute data before a wide transformation?
# repartition() triggers a full shuffle but ensures even distribution.
df.repartition(200, "customer_id").groupBy("customer_id").agg(...)

Before: Write produces 4,000 files averaging 2 MB each. Downstream reads spend 80% of time on file listing overhead.

After: coalesce(50) produces 50 files averaging 160 MB each. Downstream reads are 6x faster.

6. Avoid collect() on Large Datasets

collect() pulls the entire dataset into the driver's memory. On a 10 GB DataFrame, this crashes the driver with an OutOfMemoryError. Use display() in Databricks notebooks (auto-limits to 1,000 rows) or toPandas() with explicit limits.

PySpark - Safe Alternatives

# BAD: crashes driver on large data
all_rows = df.collect()

# GOOD: display first 1000 rows in notebook
display(df)

# GOOD: convert a small subset to Pandas
sample_df = df.limit(10000).toPandas()

7. Predicate Pushdown (Filter Early)

Spark optimizes queries by pushing filter conditions down to the data source. But this only works if you apply filters before transformations that break pushdown (like UDFs or complex joins).

PySpark - Filter Ordering

# SLOW: reads all data, then filters
result = df.join(other_df, "id").filter(col("date") > "2026-01-01")

# FAST: filter first, Spark pushes predicate to Parquet file scan
result = df.filter(col("date") > "2026-01-01").join(other_df, "id")

Before: Full table scan of 2 TB, then filter - 9 minutes.

After: Predicate pushdown skips 80% of Parquet files - 2 minutes.

8. Z-Order Delta Tables for Query Patterns

Z-ordering co-locates related data in the same set of files based on the columns you specify. This enables file-level pruning: when Spark sees a filter on a Z-ordered column, it skips files that don't contain matching values.

SQL - Z-Ordering

-- Z-order by the columns you filter on most
OPTIMIZE production.sales.orders
  ZORDER BY (customer_id, order_date);

-- Check file statistics after optimization
DESCRIBE DETAIL production.sales.orders;

Before: Query filtering on customer_id scans 500 files (no Z-ordering).

After: Same query scans 12 files (Z-ordered on customer_id). 40x less I/O.

Gotcha: Z-ordering works best on columns with high cardinality that you frequently filter on. Don't Z-order on boolean columns (only 2 values) or columns you never filter by.

9. Adaptive Query Execution (AQE)

AQE is enabled by default in Databricks Runtime 7.3+ and re-optimizes the query plan at runtime based on actual data statistics. Three key features:

Coalescing shuffle partitions: reduces the default 200 shuffle partitions to match actual data volume, eliminating thousands of empty or tiny partitions.
Skew join optimization: detects skewed join keys at runtime and splits the heavy partition into smaller chunks.
Converting sort-merge joins to broadcast: if AQE discovers a join input is smaller than expected, it switches to broadcast join automatically.

Config - AQE Settings

-- Verify AQE is enabled (should be by default)
SET spark.sql.adaptive.enabled = true;

-- Enable skew join optimization
SET spark.sql.adaptive.skewJoin.enabled = true;

-- Coalesce shuffle partitions automatically
SET spark.sql.adaptive.coalescePartitions.enabled = true;

Before: 200 shuffle partitions for a 500 MB intermediate result - 190 empty partitions, 10 doing actual work.

After: AQE coalesces to 12 partitions. 40% faster shuffle phase.

10. Monitor with Spark UI and Ganglia

The Spark UI is your primary diagnostic tool. For every slow job, check these three things:

Task duration variance: if the max task time is 100x the median, you have data skew. The Spark UI's "Summary Metrics" table shows min, median, and max task times per stage.
Spill to disk: if the "Spill (Disk)" column shows values, your executors don't have enough memory. Either increase executor memory or reduce partition size.
Shuffle read/write: large shuffle volumes (10+ GB) often indicate a missing broadcast join or unnecessary repartition. The "Stages" tab shows shuffle data per stage.

Ganglia (available on Databricks cluster metrics) shows CPU utilization, memory pressure, and network I/O across all nodes. If CPU is below 30% while the job is running, your cluster is I/O bound - look at caching and file pruning.

Common Anti-Pattern: Pandas on a Single-Node Cluster

We see this regularly: a data scientist writes pandas code, runs it on a single-node Databricks cluster, and wonders why it's slow on 50 GB of data. Pandas runs on a single core and loads everything into RAM. For large datasets, use the Pandas API on Spark (formerly Koalas), which provides the familiar pandas syntax but distributes computation across the cluster.

PySpark - Pandas API on Spark

import pyspark.pandas as ps

# Same pandas syntax, distributed across cluster
df = ps.read_parquet("/mnt/data/large_dataset/")
result = df.groupby("category").agg({"revenue": "sum", "orders": "count"})
result.sort_values("revenue", ascending=False).head(20)

Gotcha: Photon Engine Cost vs Performance

Databricks Photon engine rewrites Spark SQL execution in C++ and delivers 2–8x performance gains on scan-heavy, aggregation-heavy SQL workloads. But Photon costs more per DBU - roughly 2x the standard compute rate. Profile your workload before enabling it everywhere. If your job is 70% Python UDFs and 30% SQL scans, Photon only helps with the 30%. Run the same job with and without Photon, compare the total DBU cost, and decide per-job.

Key Takeaways

Right-size clusters based on Spark UI metrics, not guesswork. Bigger isn't always faster.
Delta cache and Z-ordering are the two highest-ROI optimizations for read-heavy workloads.
Broadcast joins eliminate shuffle for small-table joins - set the threshold to match your lookup table sizes.
Replace row-at-a-time UDFs with Pandas UDFs or native Spark functions for 10–100x speedups.
AQE handles partition coalescing and skew joins automatically - verify it's enabled.
Always check the Spark UI before scaling the cluster. The bottleneck is usually in the data, not the infrastructure.

Mohan, Senior Data Engineer

Mohan is a Senior Data Engineer at CelestInfo who evaluates and compares data platforms, tools, and architectures to help clients choose the right technology stack.

Q: What is the biggest Spark performance mistake on Databricks?

Over-provisioning clusters while ignoring data skew. A 64-node cluster won't help if 90% of the work lands on one executor because of a skewed join key. Check the Spark UI for task duration variance before adding more nodes.

Q: Should I enable Photon engine on all Databricks clusters?

Not automatically. Photon improves SQL and DataFrame performance significantly but costs more per DBU. Profile your workload first: if it's heavy on SQL scans and aggregations, Photon pays for itself. If it's mostly Python UDFs or ML training, the extra cost may not be justified.

Q: When should I use broadcast joins in Spark?

Use broadcast joins when one side of the join is small enough to fit in executor memory (typically under 100 MB–1 GB). Spark avoids the expensive shuffle step by sending the small table to every executor. Set spark.sql.autoBroadcastJoinThreshold to control the threshold.

Q: What does Z-ordering do for Delta tables?

Z-ordering co-locates related data within the same set of files based on the columns you specify. When you query with a filter on a Z-ordered column, Spark can skip entire files that don't contain matching values, reducing I/O by 50–90% for selective queries.

Burning Questions
About CelestInfo

Simple answers to make things clear.

How accurate are the AI insights?+

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Can I integrate with my existing tools?+

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

What security measures do you have?+

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

How often are insights updated?+

Insights are updated in real-time as new data becomes available.

What kind of support do you offer?+

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Spark Performance Tuning on Databricks: 10 Fixes That Actually Matter

Introduction

1. Choose the Right Cluster Size

2. Use Delta Cache

3. Broadcast Joins for Small Tables

4. Avoid UDFs When Possible (Use Pandas UDFs)

5. Fix Your Partition Strategy

6. Avoid collect() on Large Datasets

7. Predicate Pushdown (Filter Early)

8. Z-Order Delta Tables for Query Patterns

9. Adaptive Query Execution (AQE)

10. Monitor with Spark UI and Ganglia

Common Anti-Pattern: Pandas on a Single-Node Cluster

Gotcha: Photon Engine Cost vs Performance

Key Takeaways

Related Articles

Q: What is the biggest Spark performance mistake on Databricks?

Q: Should I enable Photon engine on all Databricks clusters?

Q: When should I use broadcast joins in Spark?

Q: What does Z-ordering do for Delta tables?

Burning Questions
About CelestInfo

Ready? Let's Talk!

Spark Performance Tuning on Databricks: 10 Fixes That Actually Matter

Introduction

1. Choose the Right Cluster Size

2. Use Delta Cache

3. Broadcast Joins for Small Tables

4. Avoid UDFs When Possible (Use Pandas UDFs)

5. Fix Your Partition Strategy

6. Avoid collect() on Large Datasets

7. Predicate Pushdown (Filter Early)

8. Z-Order Delta Tables for Query Patterns

9. Adaptive Query Execution (AQE)

10. Monitor with Spark UI and Ganglia

Common Anti-Pattern: Pandas on a Single-Node Cluster

Gotcha: Photon Engine Cost vs Performance

Key Takeaways

Related Articles

Q: What is the biggest Spark performance mistake on Databricks?

Q: Should I enable Photon engine on all Databricks clusters?

Q: When should I use broadcast joins in Spark?

Q: What does Z-ordering do for Delta tables?

Burning QuestionsAbout CelestInfo

Burning Questions
About CelestInfo