Incremental Data Loading in Azure Data Factory: Watermark, CDC, and Tumbling Window Patterns

Celestinfo Software Solutions Pvt. Ltd. Apr 24, 2025

Last updated: May 2025

Quick answer: ADF supports 4 incremental load patterns: watermark (store last-loaded timestamp, filter new records), change tracking (Azure SQL's built-in CT), full CDC with Debezium via Event Hubs, and tumbling window triggers for time-partitioned loads. The watermark pattern covers 80% of use cases. Key gotcha: always update the watermark AFTER a successful copy, not before - and the Copy Activity's upsert mode uses MERGE, which can deadlock on concurrent runs.

Introduction

Azure Data Factory supports three incremental load patterns: watermark columns (best for append-only tables with datetime/ID columns), Change Data Capture using SQL Server CDC or change tracking (best for tables with updates and deletes), and Tumbling Window triggers (best for time-partitioned data with fixed processing windows). Each pattern reduces data transfer by 90-99% compared to full loads. Below is the implementation for each with pipeline JSON templates. Incremental loading - pulling only new and changed records since the last run - is how you scale data ingestion without killing your source systems. ADF has several built-in patterns for this, but the documentation doesn't always cover the gotchas you'll hit in production. This guide does. For the basics of creating pipelines in ADF, check our getting-started guide.

Pattern 1: Watermark-Based Incremental Load

The watermark pattern is the simplest and most common. You store a "last loaded" timestamp in a control table, and each pipeline run filters source records where modified_date > @lastWatermark. After a successful copy, you update the watermark to the maximum modified_date from the current batch.

Setting Up the Control Table

-- Control table in your target database
CREATE TABLE etl.watermark_control (
    table_name VARCHAR(255) PRIMARY KEY,
    last_watermark DATETIME2,
    last_run_status VARCHAR(20),
    last_run_time DATETIME2
);

-- Initialize for each table
INSERT INTO etl.watermark_control VALUES
('orders', '1900-01-01', 'SUCCESS', GETDATE()),
('customers', '1900-01-01', 'SUCCESS', GETDATE());

The ADF Pipeline Flow

The pipeline has 3 activities chained with On Success dependencies:

  1. Lookup Activity - reads the current watermark from etl.watermark_control
  2. Copy Activity - copies records where modified_date > @lastWatermark AND modified_date <= @currentTime
  3. Stored Procedure Activity - updates the watermark to @currentTime only on success
-- Stored procedure to update watermark
CREATE PROCEDURE etl.usp_update_watermark
    @table_name VARCHAR(255),
    @new_watermark DATETIME2
AS
BEGIN
    UPDATE etl.watermark_control
    SET last_watermark = @new_watermark,
        last_run_status = 'SUCCESS',
        last_run_time = GETDATE()
    WHERE table_name = @table_name;
END;

Critical gotcha: The stored procedure MUST connect to the Copy Activity's On Success dependency, not run in parallel. If you update the watermark before the copy completes and the copy fails, you'll permanently skip those records. This is the single most common mistake in ADF incremental pipelines.

Handling Late-Arriving Data

Source systems don't always write records in timestamp order. A record created at 2:00 PM might not get committed until 2:15 PM due to long transactions. Add a buffer to your filter:

-- Instead of: WHERE modified_date > @lastWatermark
-- Use: WHERE modified_date > DATEADD(HOUR, -2, @lastWatermark)
--       AND modified_date <= @currentTime

This re-processes the last 2 hours of data on each run. To make this safe, use MERGE (upsert) at the target so duplicate records update instead of creating duplicates.

Pattern 2: Azure SQL Change Tracking

If your source is Azure SQL Database, you can use its built-in Change Tracking feature. CT records which rows changed (INSERT, UPDATE, DELETE) since a specific version number, without requiring a modified_date column on every table.

-- Enable change tracking on the database
ALTER DATABASE MyDB SET CHANGE_TRACKING = ON
    (CHANGE_RETENTION = 3 DAYS, AUTO_CLEANUP = ON);

-- Enable on specific tables
ALTER TABLE dbo.orders ENABLE CHANGE_TRACKING
    WITH (TRACK_COLUMNS_UPDATED = ON);

In your ADF pipeline, the Lookup Activity reads the last-synced version from your control table, and the Copy Activity uses CHANGETABLE(CHANGES ...) to get only the changed rows. The advantage over watermark: CT captures deletes, which watermark-based approaches miss entirely.

Tradeoff: Change Tracking has a retention period (3 days in the example above). If your pipeline doesn't run for longer than the retention period, you'll get an error because the change tracking data has been cleaned up. You'll need a full reload to recover.

Pattern 3: Full CDC with Debezium via Event Hubs

For real-time or near-real-time incremental loads, Debezium captures every change from the database's transaction log and pushes it to a Kafka-compatible endpoint. In Azure, that endpoint is Event Hubs with the Kafka protocol enabled.

The architecture: Debezium runs as a connector in a Kafka Connect cluster (or Azure Container Instances), reads the SQL Server/PostgreSQL/MySQL transaction log, and writes change events to Event Hubs. ADF consumes from Event Hubs using an event-based trigger, processing each batch of changes as they arrive. This is the same CDC pattern we used in our metadata-driven ADF pipeline for API-to-Snowflake loads.

When to use this over watermark: you need sub-minute latency, you need to capture deletes, or your source tables don't have reliable modified_date columns.

Pattern 4: Tumbling Window Triggers

Tumbling window triggers are ADF's way of handling time-partitioned incremental loads. Each trigger execution gets a WindowStart and WindowEnd parameter representing a non-overlapping time slice. They're ideal for loading data that's naturally partitioned by date - daily log files, hourly event data, etc.

-- In your Copy Activity source query:
SELECT * FROM events
WHERE event_date >= '@{trigger().outputs.windowStartTime}'
  AND event_date < '@{trigger().outputs.windowEndTime}'

Tumbling windows handle backfills automatically - if the pipeline was down for 3 days, the trigger creates separate executions for each missed window and processes them in order. You can also set concurrency to control how many windows run in parallel.

Gotcha: Tumbling window triggers have a 15-minute minimum interval. If you need more frequent runs, use a schedule trigger instead (supports down to 1-minute intervals, though 5 minutes is the practical minimum for most scenarios).

Advanced: Metadata-Driven Multi-Table Incremental Loads

Building a separate pipeline for each table doesn't scale. The metadata-driven pattern uses a single pipeline with a ForEach loop that iterates over a configuration table listing all source tables, their watermark columns, target destinations, and load patterns.

-- Metadata configuration table
CREATE TABLE etl.pipeline_config (
    source_schema VARCHAR(128),
    source_table VARCHAR(128),
    watermark_column VARCHAR(128),
    target_schema VARCHAR(128),
    target_table VARCHAR(128),
    load_type VARCHAR(20), -- 'incremental' or 'full'
    is_active BIT DEFAULT 1
);

INSERT INTO etl.pipeline_config VALUES
('dbo', 'orders', 'modified_date', 'staging', 'orders', 'incremental', 1),
('dbo', 'customers', 'updated_at', 'staging', 'customers', 'incremental', 1),
('dbo', 'products', NULL, 'staging', 'products', 'full', 1);

The ADF pipeline uses a Lookup to read etl.pipeline_config WHERE is_active = 1, then a ForEach activity iterates over the results. Inside the ForEach, parameterized Copy Activities dynamically set the source table, watermark filter, and target table based on each config row. One pipeline handles 50+ tables. For a detailed walkthrough of this pattern, see our guide on Azure SQL to Snowflake pipelines.

Monitoring and Alerting

ADF pipelines fail silently if you don't set up monitoring. Configure ADF diagnostic settings to send pipeline run logs to Log Analytics, then create alerts for:

Key Takeaways

CelestInfo Engineering Team

We build data pipelines on Azure, Snowflake, and Databricks. If your incremental loads are breaking at 3am, we've probably seen the same problem. About us

Related Articles

Frequently Asked Questions

What is the minimum interval for tumbling window triggers in Azure Data Factory?

Tumbling window triggers have a minimum interval of 15 minutes. If you need more frequent incremental loads, use a schedule trigger with a shorter interval or consider event-based triggers with Event Grid for near-real-time ingestion.

Should I update the watermark before or after the Copy Activity in ADF?

Always update the watermark AFTER a successful Copy Activity, not before. If you update the watermark first and the copy fails, you'll skip those records on the next run because the watermark already advanced past them. Use ADF's activity dependency (On Success) to ensure the stored procedure that updates the watermark only runs after a successful copy.

Can the ADF Copy Activity upsert mode cause deadlocks?

Yes. The Copy Activity's upsert mode for SQL targets uses MERGE statements under the hood. If multiple pipeline runs execute concurrent MERGE operations on the same target table, SQL Server can deadlock. Avoid overlapping pipeline runs on the same table, or use staging tables with a separate MERGE step that locks explicitly.

How do I handle late-arriving data in ADF incremental loads?

Add a buffer to your watermark filter. Instead of filtering WHERE modified_date > @lastWatermark, use WHERE modified_date > DATEADD(hour, -2, @lastWatermark). This re-processes the last 2 hours of data on each run. To make this idempotent, use MERGE (upsert) at the target so duplicate records are updated rather than inserted twice.

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.