How does Step Functions compare to Apache Airflow for data orchestration?

Step Functions wins on zero-ops -- no servers to manage, no scheduler to monitor, built-in retry logic. Airflow wins on scheduling flexibility (cron expressions, data-aware scheduling), open-source ecosystem, and cross-cloud portability. For AWS-native pipelines with straightforward dependencies, Step Functions is simpler. For complex DAGs with 50+ tasks and custom operators, Airflow is more powerful.

What are the execution limits for Step Functions?

Standard workflows have a 90-day maximum execution duration and 25,000 state transition events per execution. Express workflows have a 5-minute maximum duration. Nested parallel states can hit the 40 concurrent branch limit, and Map states have a default concurrency of 40 iterations running simultaneously.

Orchestrating Data Pipelines with AWS Step Functions: Patterns and Examples

Q: Can Step Functions run Glue jobs directly?

Yes. Step Functions has a native integration with AWS Glue via the StartJobRun API action. You can start a Glue job, pass parameters, and wait for completion using the .sync integration pattern. The workflow automatically polls the Glue job status and proceeds when it finishes.

Chakri

Published Nov 06, 2025 · Last updated Feb 2026

Data Engineer at CelestInfo. Specializing in cloud data platforms, ETL pipelines, and analytics solutions.

Celestinfo Software Solutions Pvt. Ltd. • Nov 06, 2025

Quick answer: AWS Step Functions orchestrates data pipelines as serverless state machines with visual workflow design, built-in retry/error handling, and native integrations with Glue, Lambda, ECS, and S3. Use Standard workflows for long-running ETL and Express for high-volume real-time processing. No Airflow server to manage, no scheduler to monitor -- but you lose Airflow's scheduling flexibility and open-source ecosystem.

Last updated: November 2025

Why Step Functions for Data Pipelines

Most data teams reach for Airflow or Luigi when they need pipeline orchestration. Step Functions offers a different trade-off: zero infrastructure management in exchange for some flexibility. There's no EC2 instance running a scheduler, no metadata database to back up, no Celery workers to scale. You define your pipeline as a JSON state machine, AWS runs it, and you pay per state transition ($0.025 per 1,000 transitions for Standard workflows).

The visual workflow designer in the AWS console is genuinely useful. You can drag-and-drop states, wire up transitions, and see execution history with per-state timing and input/output payloads. For pipelines with 5-15 steps, this beats staring at a Python DAG file. For a comparison of different ETL approaches on AWS, see our Glue vs Lambda ETL guide.

Standard vs Express Workflows

Feature	Standard	Express
Max duration	90 days	5 minutes
Pricing	$0.025 per 1,000 transitions	Per execution (duration + memory)
Wait states	Supported	Not supported
Execution history	Stored by default (90 days)	Must send to CloudWatch Logs
Exactly-once execution	Yes	At-least-once
Best for	ETL jobs, batch processing, multi-step workflows	High-volume event processing, stream transforms

For data pipelines, you'll almost always use Standard workflows. Your Glue jobs run 5-60 minutes, your ECS tasks might take hours, and you'll want Wait states to pause for external dependencies. Express workflows are for scenarios like processing 100,000 S3 events per second -- not typical ETL.

Common Data Pipeline Patterns

Pattern 1: Sequential ETL

The simplest pattern: extract, transform, load in sequence. A Glue job extracts raw data to S3, a Lambda cleans and validates it, and a second Glue job loads it into the target warehouse.

Each state uses the .sync integration pattern, which means Step Functions waits for the job to complete before moving to the next state. Without .sync, the state would fire-and-forget, moving to the next state immediately.

Pattern 2: Parallel Fan-Out with Map State

Need to process 20 tables simultaneously? The Map state iterates over an array and runs each item through the same sub-workflow in parallel. Set MaxConcurrency to control how many items run at once (default: 40).

This is perfect for ETL pipelines that process multiple source tables: pass an array of table names as input, and the Map state kicks off a Glue job for each table in parallel. Much cleaner than creating 20 parallel branches manually.

Pattern 3: Choice State for Conditional Logic

Different processing based on data characteristics. For example: if the incoming file is CSV, route to a CSV-specific Glue job; if it's Parquet, route to a different job; if it's an unknown format, send an SNS alert and stop.

Pattern 4: Error Handling with Retry and Catch

Every production pipeline needs error handling. Step Functions provides two mechanisms:

Retry: Automatically retry a failed state with configurable backoff. Set IntervalSeconds, MaxAttempts, and BackoffRate for exponential backoff.
Catch: If retries are exhausted, route to a fallback state (e.g., send an SNS notification, log to DynamoDB, or trigger a cleanup Lambda).

Complete Pipeline: 3-Stage ETL in ASL

Here's a complete Amazon States Language (ASL) definition for a 3-stage pipeline that extracts data with Glue, transforms with Lambda, and loads with another Glue job. It includes retry logic and error notification.

JSON -- Step Functions ASL Definition

{
  "Comment": "3-Stage ETL Pipeline: Extract -> Transform -> Load",
  "StartAt": "ExtractRawData",
  "States": {
    "ExtractRawData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "extract-raw-to-s3",
        "Arguments": {
          "--source_database": "production_db",
          "--target_s3_path": "s3://data-lake/raw/",
          "--run_date.$": "$.run_date"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:transform-validate",
      "Parameters": {
        "source_path": "s3://data-lake/raw/",
        "target_path": "s3://data-lake/cleaned/",
        "run_date.$": "$.run_date"
      },
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 30,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "LoadToWarehouse"
    },
    "LoadToWarehouse": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "load-cleaned-to-redshift",
        "Arguments": {
          "--source_s3_path": "s3://data-lake/cleaned/",
          "--target_schema": "analytics",
          "--run_date.$": "$.run_date"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "PipelineComplete"
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
        "Message.$": "States.Format('ETL pipeline failed at step. Error: {}', $.error.Cause)",
        "Subject": "ETL Pipeline Failure Alert"
      },
      "Next": "PipelineFailed"
    },
    "PipelineFailed": {
      "Type": "Fail",
      "Error": "PipelineExecutionFailed",
      "Cause": "One or more pipeline stages failed after retries."
    },
    "PipelineComplete": {
      "Type": "Succeed"
    }
  }
}

Integrating with AWS Services

Step Functions has native SDK integrations (not just Lambda-based wrappers) for the services data teams use most:

AWS Glue: glue:startJobRun.sync -- starts a job and waits for completion. Pass arguments dynamically from the state input.
Lambda: Direct invocation. Great for lightweight transforms, validation, and notification logic. Keep functions under the 15-minute Lambda timeout.
ECS/Fargate: ecs:runTask.sync -- runs a container task and waits. Use for heavy processing that needs more than Lambda's 10GB memory limit.
S3: s3:putObject, s3:getObject -- read and write objects directly from Step Functions without a Lambda intermediary.
DynamoDB: dynamodb:putItem, dynamodb:getItem -- log pipeline metadata, check processing status, or store configuration.
SNS/SQS: Publish notifications or queue messages for downstream consumers.

Step Functions vs Airflow

Aspect	Step Functions	Apache Airflow (MWAA)
Infrastructure	Fully serverless, zero-ops	Managed (MWAA) but still has environment config
Scheduling	EventBridge rules (basic cron)	Built-in scheduler with data-aware triggers
Pipeline definition	JSON/YAML (ASL)	Python (full programming language)
Error handling	Built-in Retry/Catch per state	Retry at task level, email alerts
Monitoring	Visual execution history, CloudWatch	Web UI with Gantt charts, logs
Cost (light usage)	$0.025/1K transitions (~free)	~$300+/month minimum for MWAA
Ecosystem	AWS-native only	1000+ community operators
Portability	AWS lock-in	Runs anywhere (GCP Composer, self-hosted)

The honest answer: for AWS-native pipelines with 5-15 steps, Step Functions is simpler and cheaper. For complex DAGs with 50+ tasks, cross-service dependencies, custom operators, and teams that value Python-first tooling, Airflow is more capable. For S3 bucket permission setup that your Step Functions IAM role needs, check our IAM role creation guide.

Gotchas and Limitations

90-day execution timeout (Standard). If your pipeline somehow runs for 90 days, it terminates. Rare for ETL, but be aware.
25,000 state transition limit. A Map state processing 10,000 items with 3 states each = 30,000 transitions. You'll hit the limit. Use Distributed Map for large-scale iterations instead.
Express workflows can't wait. No Wait state means you can't pause for external dependencies. Use Standard if you need delays.
Nested parallel states hit concurrency limits. A Parallel state inside a Map state can quickly exhaust the default 40 concurrent branches. Set explicit MaxConcurrency values.
No built-in scheduling. You need Amazon EventBridge to trigger workflows on a schedule. It works, but it's one more service to configure compared to Airflow's built-in scheduler.
Payload size limit: 256KB. Step Functions passes data between states via JSON payloads. If your Lambda returns more than 256KB, store it in S3 and pass the S3 key instead.

Key Takeaways

Step Functions is the simplest way to orchestrate AWS data pipelines if you don't need Airflow's scheduling power or open-source ecosystem.
Use Standard workflows for ETL. Express is for sub-5-minute, high-volume event processing.
Map state replaces manual fan-out. Process N tables in parallel without N parallel branches.
Retry + Catch = production-grade error handling. Every task state should have both.
The .sync integration pattern is essential. Without it, Step Functions fires-and-forgets tasks instead of waiting for completion.
Watch the 25,000 transition limit on large Map state iterations. Use Distributed Map for 10,000+ items.

Chakri, Intern

Chakri is an Intern at CelestInfo with hands-on experience across AWS, Azure, GCP, and Snowflake cloud infrastructure.

Frequently Asked Questions

Q: Should I use Standard or Express Step Functions for data pipelines?

Use Standard workflows for long-running ETL jobs (Glue, EMR, ECS) that run minutes to hours. Standard supports Wait states and has a 90-day timeout. Express workflows cap at 5 minutes, can't use Wait states, and are designed for high-volume event processing -- not typical batch ETL.

Q: How does Step Functions compare to Apache Airflow?

Step Functions wins on zero infrastructure management and low cost for small pipelines. Airflow wins on scheduling flexibility (cron, data-aware triggers), Python-first DAG authoring, 1000+ community operators, and cross-cloud portability. For AWS-native pipelines with straightforward dependencies, Step Functions is simpler. For complex DAGs with dozens of tasks and custom logic, Airflow is more powerful.

Q: Can Step Functions run Glue jobs directly?

Yes. Use the glue:startJobRun.sync resource ARN to start a Glue job and wait for completion. Step Functions polls the job status automatically. You can pass dynamic arguments from the state machine input using JSONPath expressions.

Q: What are the main execution limits?

Standard workflows: 90-day max duration, 25,000 state transitions per execution. Express: 5-minute max duration. Map state default concurrency is 40 parallel iterations. Payload size between states is limited to 256KB. For large-scale iterations (10,000+ items), use Distributed Map mode instead of inline Map.

Burning Questions
About CelestInfo

Simple answers to make things clear.

How accurate are the AI insights?+

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Can I integrate with my existing tools?+

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

What security measures do you have?+

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

How often are insights updated?+

Insights are updated in real-time as new data becomes available.

What kind of support do you offer?+

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Get Assistance

Orchestrating Data Pipelines with AWS Step Functions: Patterns and Examples

Why Step Functions for Data Pipelines

Standard vs Express Workflows

Common Data Pipeline Patterns

Pattern 1: Sequential ETL

Pattern 2: Parallel Fan-Out with Map State

Pattern 3: Choice State for Conditional Logic

Pattern 4: Error Handling with Retry and Catch

Complete Pipeline: 3-Stage ETL in ASL

Integrating with AWS Services

Step Functions vs Airflow

Gotchas and Limitations

Key Takeaways

Frequently Asked Questions

Q: Should I use Standard or Express Step Functions for data pipelines?

Q: How does Step Functions compare to Apache Airflow?

Q: Can Step Functions run Glue jobs directly?

Q: What are the main execution limits?

Related Articles

Burning Questions
About CelestInfo

Ready? Let's Talk!

Orchestrating Data Pipelines with AWS Step Functions: Patterns and Examples

Why Step Functions for Data Pipelines

Standard vs Express Workflows

Common Data Pipeline Patterns

Pattern 1: Sequential ETL

Pattern 2: Parallel Fan-Out with Map State

Pattern 3: Choice State for Conditional Logic

Pattern 4: Error Handling with Retry and Catch

Complete Pipeline: 3-Stage ETL in ASL

Integrating with AWS Services

Step Functions vs Airflow

Gotchas and Limitations

Key Takeaways

Frequently Asked Questions

Q: Should I use Standard or Express Step Functions for data pipelines?

Q: How does Step Functions compare to Apache Airflow?

Q: Can Step Functions run Glue jobs directly?

Q: What are the main execution limits?

Related Articles

Burning QuestionsAbout CelestInfo

Burning Questions
About CelestInfo