AWS Step Functions Data Orchestration

Orchestrating Data Pipelines with AWS Step Functions: Patterns and Examples

Celestinfo Software Solutions Pvt. Ltd. Nov 06, 2025

Quick answer: AWS Step Functions orchestrates data pipelines as serverless state machines with visual workflow design, built-in retry/error handling, and native integrations with Glue, Lambda, ECS, and S3. Use Standard workflows for long-running ETL and Express for high-volume real-time processing. No Airflow server to manage, no scheduler to monitor -- but you lose Airflow's scheduling flexibility and open-source ecosystem.

Last updated: November 2025

Why Step Functions for Data Pipelines

Most data teams reach for Airflow or Luigi when they need pipeline orchestration. Step Functions offers a different trade-off: zero infrastructure management in exchange for some flexibility. There's no EC2 instance running a scheduler, no metadata database to back up, no Celery workers to scale. You define your pipeline as a JSON state machine, AWS runs it, and you pay per state transition ($0.025 per 1,000 transitions for Standard workflows).


The visual workflow designer in the AWS console is genuinely useful. You can drag-and-drop states, wire up transitions, and see execution history with per-state timing and input/output payloads. For pipelines with 5-15 steps, this beats staring at a Python DAG file. For a comparison of different ETL approaches on AWS, see our Glue vs Lambda ETL guide.


Standard vs Express Workflows


FeatureStandardExpress
Max duration90 days5 minutes
Pricing$0.025 per 1,000 transitionsPer execution (duration + memory)
Wait statesSupportedNot supported
Execution historyStored by default (90 days)Must send to CloudWatch Logs
Exactly-once executionYesAt-least-once
Best forETL jobs, batch processing, multi-step workflowsHigh-volume event processing, stream transforms

For data pipelines, you'll almost always use Standard workflows. Your Glue jobs run 5-60 minutes, your ECS tasks might take hours, and you'll want Wait states to pause for external dependencies. Express workflows are for scenarios like processing 100,000 S3 events per second -- not typical ETL.


Common Data Pipeline Patterns


Pattern 1: Sequential ETL

The simplest pattern: extract, transform, load in sequence. A Glue job extracts raw data to S3, a Lambda cleans and validates it, and a second Glue job loads it into the target warehouse.


Each state uses the .sync integration pattern, which means Step Functions waits for the job to complete before moving to the next state. Without .sync, the state would fire-and-forget, moving to the next state immediately.


Pattern 2: Parallel Fan-Out with Map State

Need to process 20 tables simultaneously? The Map state iterates over an array and runs each item through the same sub-workflow in parallel. Set MaxConcurrency to control how many items run at once (default: 40).


This is perfect for ETL pipelines that process multiple source tables: pass an array of table names as input, and the Map state kicks off a Glue job for each table in parallel. Much cleaner than creating 20 parallel branches manually.


Pattern 3: Choice State for Conditional Logic

Different processing based on data characteristics. For example: if the incoming file is CSV, route to a CSV-specific Glue job; if it's Parquet, route to a different job; if it's an unknown format, send an SNS alert and stop.


Pattern 4: Error Handling with Retry and Catch

Every production pipeline needs error handling. Step Functions provides two mechanisms:



Complete Pipeline: 3-Stage ETL in ASL


Here's a complete Amazon States Language (ASL) definition for a 3-stage pipeline that extracts data with Glue, transforms with Lambda, and loads with another Glue job. It includes retry logic and error notification.


JSON -- Step Functions ASL Definition
{
  "Comment": "3-Stage ETL Pipeline: Extract -> Transform -> Load",
  "StartAt": "ExtractRawData",
  "States": {
    "ExtractRawData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "extract-raw-to-s3",
        "Arguments": {
          "--source_database": "production_db",
          "--target_s3_path": "s3://data-lake/raw/",
          "--run_date.$": "$.run_date"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:transform-validate",
      "Parameters": {
        "source_path": "s3://data-lake/raw/",
        "target_path": "s3://data-lake/cleaned/",
        "run_date.$": "$.run_date"
      },
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 30,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "LoadToWarehouse"
    },
    "LoadToWarehouse": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "load-cleaned-to-redshift",
        "Arguments": {
          "--source_s3_path": "s3://data-lake/cleaned/",
          "--target_schema": "analytics",
          "--run_date.$": "$.run_date"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "NotifyFailure"
        }
      ],
      "Next": "PipelineComplete"
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
        "Message.$": "States.Format('ETL pipeline failed at step. Error: {}', $.error.Cause)",
        "Subject": "ETL Pipeline Failure Alert"
      },
      "Next": "PipelineFailed"
    },
    "PipelineFailed": {
      "Type": "Fail",
      "Error": "PipelineExecutionFailed",
      "Cause": "One or more pipeline stages failed after retries."
    },
    "PipelineComplete": {
      "Type": "Succeed"
    }
  }
}

Integrating with AWS Services


Step Functions has native SDK integrations (not just Lambda-based wrappers) for the services data teams use most:



Step Functions vs Airflow


AspectStep FunctionsApache Airflow (MWAA)
InfrastructureFully serverless, zero-opsManaged (MWAA) but still has environment config
SchedulingEventBridge rules (basic cron)Built-in scheduler with data-aware triggers
Pipeline definitionJSON/YAML (ASL)Python (full programming language)
Error handlingBuilt-in Retry/Catch per stateRetry at task level, email alerts
MonitoringVisual execution history, CloudWatchWeb UI with Gantt charts, logs
Cost (light usage)$0.025/1K transitions (~free)~$300+/month minimum for MWAA
EcosystemAWS-native only1000+ community operators
PortabilityAWS lock-inRuns anywhere (GCP Composer, self-hosted)

The honest answer: for AWS-native pipelines with 5-15 steps, Step Functions is simpler and cheaper. For complex DAGs with 50+ tasks, cross-service dependencies, custom operators, and teams that value Python-first tooling, Airflow is more capable. For S3 bucket permission setup that your Step Functions IAM role needs, check our IAM role creation guide.


Gotchas and Limitations



Key Takeaways


Chakri, Cloud Solutions Architect

Chakri is a Cloud Solutions Architect at CelestInfo with hands-on experience across AWS, Azure, GCP, and Snowflake cloud infrastructure.


Frequently Asked Questions

Q: Should I use Standard or Express Step Functions for data pipelines?

Use Standard workflows for long-running ETL jobs (Glue, EMR, ECS) that run minutes to hours. Standard supports Wait states and has a 90-day timeout. Express workflows cap at 5 minutes, can't use Wait states, and are designed for high-volume event processing -- not typical batch ETL.

Q: How does Step Functions compare to Apache Airflow?

Step Functions wins on zero infrastructure management and low cost for small pipelines. Airflow wins on scheduling flexibility (cron, data-aware triggers), Python-first DAG authoring, 1000+ community operators, and cross-cloud portability. For AWS-native pipelines with straightforward dependencies, Step Functions is simpler. For complex DAGs with dozens of tasks and custom logic, Airflow is more powerful.

Q: Can Step Functions run Glue jobs directly?

Yes. Use the glue:startJobRun.sync resource ARN to start a Glue job and wait for completion. Step Functions polls the job status automatically. You can pass dynamic arguments from the state machine input using JSONPath expressions.

Q: What are the main execution limits?

Standard workflows: 90-day max duration, 25,000 state transitions per execution. Express: 5-minute max duration. Map state default concurrency is 40 parallel iterations. Payload size between states is limited to 256KB. For large-scale iterations (10,000+ items), use Distributed Map mode instead of inline Map.

Related Articles

Burning Questions
About CelestInfo

Simple answers to make things clear.

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

Insights are updated in real-time as new data becomes available.

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to yourbusiness requirements and transformation.