AWS Glue vs Lambda for ETL: When to Use Which (And When Neither Is Right)
Last updated: November 2025
Quick answer: Use AWS Glue for batch ETL jobs over 1GB that need Spark, joins, or the Data Catalog. Use Lambda for event-driven transforms under 10GB that finish in under 15 minutes. Use neither for sustained streaming (pick Kinesis Data Analytics or Flink) or complex multi-step orchestration (pick Step Functions with either service).
The Real Question Nobody Asks
Use AWS Glue for batch ETL jobs processing over 1GB of data with complex transformations — it handles Spark infrastructure, job bookmarks, and schema evolution automatically. Use Lambda for lightweight, event-driven ETL under 1GB where sub-second latency matters (S3 triggers, API transformations, small file processing). Use neither when you need streaming — choose Kinesis or MSK instead. Here’s the decision framework with cost breakdowns., and AWS's own documentation doesn't help - it markets both as great for ETL without clearly explaining when each one falls apart. We've built production pipelines with both, and the answer almost always comes down to three things: data volume, execution pattern, and whether you need the Glue Data Catalog.
AWS Glue: The Spark-Powered Workhorse
Glue is Apache Spark under the hood. When you run a Glue ETL job, you're spinning up a Spark cluster managed by AWS. That means you get distributed processing, built-in support for reading/writing Parquet, ORC, JSON, and CSV, and the ability to handle datasets that don't fit in memory on a single machine.
When Glue Makes Sense
- Batch processing over 1GB. Spark's distributed engine shines here. A 50GB CSV-to-Parquet conversion that'd take 45 minutes in Lambda (if it even fit) runs in 8 minutes on a Glue G.2X worker with 10 DPUs.
- Complex joins across multiple datasets. If you're joining a 200M-row fact table with dimension tables, Glue handles the shuffle and sort-merge join automatically.
- Data Catalog integration. Glue Crawlers automatically discover schemas from S3, and the Data Catalog becomes the metadata layer for Athena, Redshift Spectrum, and EMR. If you need a shared schema registry, Glue is the native choice.
- Spark SQL workloads. Your team writes SQL? Glue supports Spark SQL natively. Write your transforms in SQL, and Glue compiles them to Spark execution plans.
- PySpark or Scala custom transforms. When you need UDFs, window functions, or custom partition logic that goes beyond what Lambda can handle.
Glue Gotchas You'll Hit
- Cold start is 30-90 seconds - even for tiny jobs. You're waiting for a Spark cluster to spin up. For a job that processes 50 rows, you'll spend more time starting Glue than running the actual transform. Glue Flex jobs help (cheaper, but longer cold starts of 2-5 minutes).
- Glue 4.0 auto-scaling doesn't always scale down fast enough. You set
--enable-auto-scalingand max workers at 20, but if your job has an initial burst followed by light processing, you might pay for 20 workers for 3 minutes when you only needed them for 30 seconds. - Debugging is painful. Spark error messages are notoriously cryptic. A
Py4JJavaErrorwrapping anAnalysisExceptionwrapping aNullPointerExceptiontells you almost nothing. Enable--enable-continuous-cloudwatch-logand--enable-spark-uifrom day one. - The Glue Crawler isn't magic. It guesses schemas, and it guesses wrong more often than you'd expect. Nested JSON with mixed types? It'll create a schema with
struct<string>where you wanted an array. Define schemas explicitly in the Data Catalog instead of relying on crawlers for production workloads.
AWS Lambda: The Event-Driven Scalpel
Lambda is a single-machine, single-invocation compute function. No clusters, no Spark, no distributed processing. It starts in milliseconds (after the first cold start), runs your code, and shuts down. You pay only for the milliseconds it runs.
When Lambda Makes Sense
- Small file processing. A new CSV lands in S3 at 50MB. Lambda picks it up, converts it to Parquet, writes it to a processed bucket. Done in 12 seconds. Cost: fractions of a cent.
- Real-time triggers from S3, SQS, or Kinesis. An S3
PutObjectevent fires Lambda within seconds. SQS messages trigger batch processing with up to 10 messages per invocation. Kinesis records stream through Lambda in near-real-time. - API response transformation. An API Gateway endpoint receives a request, Lambda transforms the payload and writes to DynamoDB or S3. Sub-second latency, no infrastructure to manage.
- File format conversion. JSON to Parquet, CSV to JSON, XML to flattened CSV. For files under 500MB, Lambda with the
pyarroworpandaslayer handles this faster and cheaper than Glue. - Lightweight enrichment. Add a timestamp, geocode an address via API, validate a schema. Quick transforms that don't need distributed processing.
Lambda Gotchas You'll Hit
- 15-minute hard timeout. If your ETL job can't finish in 15 minutes, Lambda won't work. Period. No workaround except splitting the job into smaller chunks.
- 10GB memory limit. Your function plus its data must fit in 10GB. A 3GB CSV loaded into a pandas DataFrame with joins and aggregations can easily blow past this. The
/tmpdirectory is also capped at 10GB (up from 512MB since late 2022). - Cold starts with large deployment packages. If your Lambda uses heavy libraries like
pandas,pyarrow, andnumpy, the first invocation (cold start) can take 3-8 seconds. Use Lambda Layers or container images to mitigate. - Concurrency limits can bite. Default account concurrency is 1,000 concurrent executions. If you have 200 S3 files landing simultaneously, each triggering a Lambda, and each Lambda calls an API with rate limits, you'll hit throttling fast. Set reserved concurrency per function.
- No built-in schema registry. Lambda doesn't know about the Glue Data Catalog unless you explicitly call it. There's no automatic schema discovery or metadata management.
When Neither Is Right
This is the part most comparison articles skip. There are workloads where both Glue and Lambda are the wrong answer:
- Sustained streaming (sub-second latency). Neither Glue nor Lambda is designed for continuous stream processing. Use Amazon Kinesis Data Analytics (Apache Flink) or Amazon Managed Streaming for Apache Kafka (MSK) with Flink consumers. Glue Streaming exists but it's batch-in-disguise with micro-batches, not true streaming.
- Complex multi-step orchestration. If your pipeline has 15 steps with conditional branching, retries, and human approval gates, don't try to chain Lambdas or sequence Glue jobs manually. Use AWS Step Functions to orchestrate either (or both) services. See our guide on building pipelines in ADF for a comparison with Azure's orchestration approach.
- Heavy ML feature engineering. If you're building feature pipelines for ML models, you'll outgrow both. Look at Amazon SageMaker Processing or Databricks for feature engineering at scale.
Cost Comparison: DPU-Hours vs GB-Seconds
This is where the decision often gets made. Here's how the math works:
Glue: ~$0.44 per DPU-hour. A minimum of 2 DPUs. A 10-minute job with 2 DPUs costs about $0.15. A 2-hour job with 10 DPUs costs $8.80. Glue bills per second with a 1-minute minimum (was 10-minute minimum before Glue 4.0).
Lambda: $0.0000166667 per GB-second. A function with 1GB memory running for 60 seconds costs $0.001. Running that 1,000 times per day costs $1/day. But bump to 10GB memory and 900 seconds (15 min), and you're at $0.15 per invocation - suddenly comparable to Glue.
Rule of thumb: For jobs that run less than 5 minutes and need less than 3GB memory, Lambda is almost always cheaper. For jobs that process more than 5GB of data or run longer than 10 minutes, Glue is typically more cost-effective per GB processed because Spark's distributed processing finishes faster.
Decision Flowchart (Text Version)
- Is your data over 10GB per job? Yes → Glue. Lambda physically can't handle it.
- Does the job need to finish in under 2 seconds? Yes → Lambda (warm) or neither (consider DynamoDB Streams + Lambda).
- Do you need continuous streaming? Yes → Neither. Use Kinesis Data Analytics or Flink on MSK.
- Is it triggered by an event (S3, SQS, API Gateway)? Yes + data under 1GB → Lambda. Yes + data over 1GB → Lambda triggers Glue.
- Do you need the Glue Data Catalog? Yes → Glue (or Lambda + Glue Catalog API calls, but that's extra work).
- Is it a scheduled batch job over 1GB with joins? Yes → Glue.
- Is it a simple file transform under 500MB? Yes → Lambda.
- Still not sure? Start with Lambda. It's easier to prototype. Migrate to Glue when you hit Lambda's limits.
Key Takeaways
- Glue = Spark cluster for heavy batch ETL. Lambda = single-machine function for lightweight, event-driven transforms.
- Glue's cold start (30-90s) makes it wrong for low-latency workloads. Lambda's 15-min timeout makes it wrong for large batch jobs.
- For sustained streaming, skip both and use Kinesis Data Analytics or Flink.
- Cost crossover point is roughly 5 minutes / 3GB - below that, Lambda wins; above that, Glue wins.
- The best architecture often uses both: Lambda for event handling and triggering, Glue for the heavy processing.
Frequently Asked Questions
Q: Is AWS Glue better than Lambda for ETL?
It depends on data volume and workload pattern. Glue is better for batch processing over 1GB with complex joins and Spark SQL. Lambda is better for event-driven, lightweight transforms under 10GB with sub-minute latency requirements. Neither is universally better.
Q: What is the maximum timeout for AWS Lambda?
AWS Lambda has a hard maximum timeout of 15 minutes and a memory limit of 10GB. If your ETL job regularly exceeds either limit, Glue or Step Functions with Glue is a better fit.
Q: How much does AWS Glue cost compared to Lambda?
Glue charges per DPU-hour (roughly $0.44/DPU-hour). Lambda charges per GB-second of compute ($0.0000166667/GB-second). For small, frequent jobs Lambda is cheaper. For large batch jobs running 30+ minutes, Glue often costs less per GB processed.
Q: Can I use both Glue and Lambda together for ETL?
Yes, and many teams do. A common pattern is Lambda for lightweight event-driven triggers (file arrival, API calls) that orchestrate or kick off Glue jobs for heavy processing. Step Functions can coordinate both.