Orchestrating Data Pipelines with AWS Step Functions: Patterns and Examples
Quick answer: AWS Step Functions orchestrates data pipelines as serverless state machines with visual workflow design, built-in retry/error handling, and native integrations with Glue, Lambda, ECS, and S3. Use Standard workflows for long-running ETL and Express for high-volume real-time processing. No Airflow server to manage, no scheduler to monitor -- but you lose Airflow's scheduling flexibility and open-source ecosystem.
Last updated: November 2025
Why Step Functions for Data Pipelines
Most data teams reach for Airflow or Luigi when they need pipeline orchestration. Step Functions offers a different trade-off: zero infrastructure management in exchange for some flexibility. There's no EC2 instance running a scheduler, no metadata database to back up, no Celery workers to scale. You define your pipeline as a JSON state machine, AWS runs it, and you pay per state transition ($0.025 per 1,000 transitions for Standard workflows).
The visual workflow designer in the AWS console is genuinely useful. You can drag-and-drop states, wire up transitions, and see execution history with per-state timing and input/output payloads. For pipelines with 5-15 steps, this beats staring at a Python DAG file. For a comparison of different ETL approaches on AWS, see our Glue vs Lambda ETL guide.
Standard vs Express Workflows
| Feature | Standard | Express |
|---|---|---|
| Max duration | 90 days | 5 minutes |
| Pricing | $0.025 per 1,000 transitions | Per execution (duration + memory) |
| Wait states | Supported | Not supported |
| Execution history | Stored by default (90 days) | Must send to CloudWatch Logs |
| Exactly-once execution | Yes | At-least-once |
| Best for | ETL jobs, batch processing, multi-step workflows | High-volume event processing, stream transforms |
For data pipelines, you'll almost always use Standard workflows. Your Glue jobs run 5-60 minutes, your ECS tasks might take hours, and you'll want Wait states to pause for external dependencies. Express workflows are for scenarios like processing 100,000 S3 events per second -- not typical ETL.
Common Data Pipeline Patterns
Pattern 1: Sequential ETL
The simplest pattern: extract, transform, load in sequence. A Glue job extracts raw data to S3, a Lambda cleans and validates it, and a second Glue job loads it into the target warehouse.
Each state uses the .sync integration pattern, which means Step Functions waits for the job to complete before moving to the next state. Without .sync, the state would fire-and-forget, moving to the next state immediately.
Pattern 2: Parallel Fan-Out with Map State
Need to process 20 tables simultaneously? The Map state iterates over an array and runs each item through the same sub-workflow in parallel. Set MaxConcurrency to control how many items run at once (default: 40).
This is perfect for ETL pipelines that process multiple source tables: pass an array of table names as input, and the Map state kicks off a Glue job for each table in parallel. Much cleaner than creating 20 parallel branches manually.
Pattern 3: Choice State for Conditional Logic
Different processing based on data characteristics. For example: if the incoming file is CSV, route to a CSV-specific Glue job; if it's Parquet, route to a different job; if it's an unknown format, send an SNS alert and stop.
Pattern 4: Error Handling with Retry and Catch
Every production pipeline needs error handling. Step Functions provides two mechanisms:
- Retry: Automatically retry a failed state with configurable backoff. Set
IntervalSeconds,MaxAttempts, andBackoffRatefor exponential backoff. - Catch: If retries are exhausted, route to a fallback state (e.g., send an SNS notification, log to DynamoDB, or trigger a cleanup Lambda).
Complete Pipeline: 3-Stage ETL in ASL
Here's a complete Amazon States Language (ASL) definition for a 3-stage pipeline that extracts data with Glue, transforms with Lambda, and loads with another Glue job. It includes retry logic and error notification.
{
"Comment": "3-Stage ETL Pipeline: Extract -> Transform -> Load",
"StartAt": "ExtractRawData",
"States": {
"ExtractRawData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "extract-raw-to-s3",
"Arguments": {
"--source_database": "production_db",
"--target_s3_path": "s3://data-lake/raw/",
"--run_date.$": "$.run_date"
}
},
"Retry": [
{
"ErrorEquals": ["Glue.AWSGlueException"],
"IntervalSeconds": 60,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "NotifyFailure"
}
],
"Next": "TransformData"
},
"TransformData": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:transform-validate",
"Parameters": {
"source_path": "s3://data-lake/raw/",
"target_path": "s3://data-lake/cleaned/",
"run_date.$": "$.run_date"
},
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 30,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "NotifyFailure"
}
],
"Next": "LoadToWarehouse"
},
"LoadToWarehouse": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "load-cleaned-to-redshift",
"Arguments": {
"--source_s3_path": "s3://data-lake/cleaned/",
"--target_schema": "analytics",
"--run_date.$": "$.run_date"
}
},
"Retry": [
{
"ErrorEquals": ["Glue.AWSGlueException"],
"IntervalSeconds": 60,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "NotifyFailure"
}
],
"Next": "PipelineComplete"
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
"Message.$": "States.Format('ETL pipeline failed at step. Error: {}', $.error.Cause)",
"Subject": "ETL Pipeline Failure Alert"
},
"Next": "PipelineFailed"
},
"PipelineFailed": {
"Type": "Fail",
"Error": "PipelineExecutionFailed",
"Cause": "One or more pipeline stages failed after retries."
},
"PipelineComplete": {
"Type": "Succeed"
}
}
}
Integrating with AWS Services
Step Functions has native SDK integrations (not just Lambda-based wrappers) for the services data teams use most:
- AWS Glue:
glue:startJobRun.sync-- starts a job and waits for completion. Pass arguments dynamically from the state input. - Lambda: Direct invocation. Great for lightweight transforms, validation, and notification logic. Keep functions under the 15-minute Lambda timeout.
- ECS/Fargate:
ecs:runTask.sync-- runs a container task and waits. Use for heavy processing that needs more than Lambda's 10GB memory limit. - S3:
s3:putObject,s3:getObject-- read and write objects directly from Step Functions without a Lambda intermediary. - DynamoDB:
dynamodb:putItem,dynamodb:getItem-- log pipeline metadata, check processing status, or store configuration. - SNS/SQS: Publish notifications or queue messages for downstream consumers.
Step Functions vs Airflow
| Aspect | Step Functions | Apache Airflow (MWAA) |
|---|---|---|
| Infrastructure | Fully serverless, zero-ops | Managed (MWAA) but still has environment config |
| Scheduling | EventBridge rules (basic cron) | Built-in scheduler with data-aware triggers |
| Pipeline definition | JSON/YAML (ASL) | Python (full programming language) |
| Error handling | Built-in Retry/Catch per state | Retry at task level, email alerts |
| Monitoring | Visual execution history, CloudWatch | Web UI with Gantt charts, logs |
| Cost (light usage) | $0.025/1K transitions (~free) | ~$300+/month minimum for MWAA |
| Ecosystem | AWS-native only | 1000+ community operators |
| Portability | AWS lock-in | Runs anywhere (GCP Composer, self-hosted) |
The honest answer: for AWS-native pipelines with 5-15 steps, Step Functions is simpler and cheaper. For complex DAGs with 50+ tasks, cross-service dependencies, custom operators, and teams that value Python-first tooling, Airflow is more capable. For S3 bucket permission setup that your Step Functions IAM role needs, check our IAM role creation guide.
Gotchas and Limitations
- 90-day execution timeout (Standard). If your pipeline somehow runs for 90 days, it terminates. Rare for ETL, but be aware.
- 25,000 state transition limit. A Map state processing 10,000 items with 3 states each = 30,000 transitions. You'll hit the limit. Use Distributed Map for large-scale iterations instead.
- Express workflows can't wait. No Wait state means you can't pause for external dependencies. Use Standard if you need delays.
- Nested parallel states hit concurrency limits. A Parallel state inside a Map state can quickly exhaust the default 40 concurrent branches. Set explicit
MaxConcurrencyvalues. - No built-in scheduling. You need Amazon EventBridge to trigger workflows on a schedule. It works, but it's one more service to configure compared to Airflow's built-in scheduler.
- Payload size limit: 256KB. Step Functions passes data between states via JSON payloads. If your Lambda returns more than 256KB, store it in S3 and pass the S3 key instead.
Key Takeaways
- Step Functions is the simplest way to orchestrate AWS data pipelines if you don't need Airflow's scheduling power or open-source ecosystem.
- Use Standard workflows for ETL. Express is for sub-5-minute, high-volume event processing.
- Map state replaces manual fan-out. Process N tables in parallel without N parallel branches.
- Retry + Catch = production-grade error handling. Every task state should have both.
- The .sync integration pattern is essential. Without it, Step Functions fires-and-forgets tasks instead of waiting for completion.
- Watch the 25,000 transition limit on large Map state iterations. Use Distributed Map for 10,000+ items.
Frequently Asked Questions
Q: Should I use Standard or Express Step Functions for data pipelines?
Use Standard workflows for long-running ETL jobs (Glue, EMR, ECS) that run minutes to hours. Standard supports Wait states and has a 90-day timeout. Express workflows cap at 5 minutes, can't use Wait states, and are designed for high-volume event processing -- not typical batch ETL.
Q: How does Step Functions compare to Apache Airflow?
Step Functions wins on zero infrastructure management and low cost for small pipelines. Airflow wins on scheduling flexibility (cron, data-aware triggers), Python-first DAG authoring, 1000+ community operators, and cross-cloud portability. For AWS-native pipelines with straightforward dependencies, Step Functions is simpler. For complex DAGs with dozens of tasks and custom logic, Airflow is more powerful.
Q: Can Step Functions run Glue jobs directly?
Yes. Use the glue:startJobRun.sync resource ARN to start a Glue job and wait for completion. Step Functions polls the job status automatically. You can pass dynamic arguments from the state machine input using JSONPath expressions.
Q: What are the main execution limits?
Standard workflows: 90-day max duration, 25,000 state transitions per execution. Express: 5-minute max duration. Map state default concurrency is 40 parallel iterations. Payload size between states is limited to 256KB. For large-scale iterations (10,000+ items), use Distributed Map mode instead of inline Map.
