From REST API to Snowflake: Building a Metadata-Driven Azure Data Factory Pipeline
Last updated: January 2026
Quick answer: Build a metadata-driven ADF pipeline by storing API configurations (URL, endpoint, target table) in a Snowflake metadata table. Use a Lookup activity to read configurations, a ForEach activity to iterate over each API, a Copy activity to fetch JSON responses into ADLS Gen2, and a Snowflake stored procedure to dynamically create VARIANT tables and load the JSON data. Add new APIs by inserting rows into the metadata table -- no pipeline changes needed.
Introduction
A metadata-driven Azure Data Factory pipeline automates REST API ingestion into Snowflake by decoupling pipeline logic from source configuration. If you are new to ADF, start with our beginner guide to creating a pipeline in Azure Data Factory. Instead of building separate ADF pipelines for each API, this approach reads API configurations from a Snowflake metadata table, dynamically fetches data, and loads it into Snowflake VARIANT columns.
This metadata-driven ADF pipeline tutorial demonstrates how to build a reusable pipeline that automatically fetches data from multiple REST APIs, stores JSON responses in ADLS Gen2, creates Snowflake tables dynamically, and supports incremental loads with comprehensive execution logging.
By the end of this tutorial, you will have a reusable, production-ready pipeline architecture that supports incremental loads through time-window parameters, comprehensive execution logging, and full error handling with success and failure tracking. For another Fabric-based approach, explore our Azure SQL to Snowflake via Fabric and OneLake pipeline.
Architecture Overview
The end-to-end data flow follows this path:
REST API → ADLS Gen2 (JSON Files) → Snowflake (VARIANT Table)
The pipeline executes the following steps in sequence:
- Read Metadata: API details (URL, endpoint, target table name) are stored in a Snowflake metadata table called
API_METADATA. - ADF Lookup Activity: Reads all active API configurations from the metadata table.
- ForEach Activity: Iterates over each API configuration returned by the Lookup.
- REST Copy Activity: Fetches the API response using the Azure Data Factory REST connector.
- Save to ADLS: Stores the raw JSON response as a file in Azure Data Lake Storage Gen2.
- Dynamic Table Creation: Uses a Script Activity to create the Snowflake target table if it does not already exist.
- Load into Snowflake: A Copy Activity loads the JSON file from ADLS into the Snowflake VARIANT column.
- Log Execution Status: Script Activities log success or failure into an execution log table.
- Time Window Control: A pipeline parameter called
END_TScontrols time windows for incremental loads.
Linked Services
The pipeline requires three linked services to connect ADF to external systems:
| Linked Service | Type | Purpose |
|---|---|---|
| REST Linked Service | REST | Connects to REST APIs using base URL and authentication settings. The base URL is parameterized so it can be dynamically set per API from the metadata table. |
| Snowflake Linked Service | Snowflake | Connects to your Snowflake account using account name, database, warehouse, and authentication credentials. Used for both reading metadata and writing data. |
| Azure Blob Storage Linked Service | Azure Blob Storage | Connects to ADLS Gen2 via SAS URL or service principal. Used as the intermediate staging area for JSON files before loading into Snowflake. |
Datasets
Four datasets are required to support the pipeline activities:
| Dataset | Format | Parameter | Purpose |
|---|---|---|---|
| JSON Dataset | JSON | FILENAME | Points to the ADLS location where JSON responses are saved. The filename is dynamically set based on the API name and timestamp. |
| REST Dataset | REST | BaseURL | Represents the REST API source. The BaseURL parameter is populated from the metadata table for each API. |
| Snowflake Source Dataset | Snowflake | None | Used by the Lookup activity to read from the API_METADATA table. |
| Snowflake Sink Dataset | Snowflake | TABLENAME | Used by the Copy Activity to load JSON data into the dynamically specified Snowflake table. |
Pipeline Design
Pipeline Parameter
The pipeline accepts one parameter:
- END_TS (String): Defines the end timestamp for the data extraction window. This enables incremental loading by allowing the pipeline to filter API requests based on a time range.
Activity Flow
The pipeline is organized as follows:
1. Lookup Activity – Read API Metadata
The Lookup activity queries the API_METADATA table in Snowflake to retrieve all active API configurations. It uses a query like:
SELECT API_ID, API_NAME, BASE_URL, ENDPOINT, TABLE_NAME
FROM API_METADATA
WHERE IS_ACTIVE = TRUE;
The output is a list of API records that the ForEach activity will iterate over.
2. ForEach Activity – Iterate Over Each API
The ForEach activity loops through each record returned by the Lookup. Inside the ForEach, the following activities are chained together:
2a. Script Activity (LOGSTART) – Log Start of Execution
Before fetching any data, a Script Activity inserts a log record into the EXECUTION_LOG table with a status of RUNNING:
INSERT INTO EXECUTION_LOG (API_NAME, TABLE_NAME, STATUS, START_TIME)
VALUES (
'@{item().API_NAME}',
'@{item().TABLE_NAME}',
'RUNNING',
CURRENT_TIMESTAMP()
);
2b. Copy Activity (REST to JSON) – Fetch API and Save to ADLS
This Copy Activity uses the REST dataset as the source and the JSON dataset as the sink. The source URL is constructed dynamically by combining the BASE_URL and ENDPOINT fields from the metadata. The JSON file is saved to ADLS Gen2 with a filename based on the API name and current timestamp, ensuring each execution creates a unique file.
2c. Script Activity (TABLE CREATION) – Create Snowflake Table Dynamically
A Script Activity executes a CREATE TABLE IF NOT EXISTS statement in Snowflake. The table uses a single VARIANT column to store the entire JSON response:
CREATE TABLE IF NOT EXISTS @{item().TABLE_NAME} (
RAW_DATA VARIANT,
LOAD_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
SOURCE_FILE STRING
);
This eliminates the need to manually create tables before running the pipeline. Each new API added to the metadata table will automatically have its target table created on first execution.
2d. Copy Activity (JSON to Snowflake) – Load JSON into VARIANT Column
This Copy Activity reads the JSON file from ADLS and loads it into the Snowflake table created in the previous step. The Snowflake sink dataset uses the TABLENAME parameter, which is dynamically set to @{item().TABLE_NAME}. The JSON data is mapped to the RAW_DATA VARIANT column, preserving the full JSON structure.
2e. Script Activity (UPDATE LOG – SUCCESS)
On the success path of the Copy Activity, this Script Activity updates the execution log to mark the run as completed:
UPDATE EXECUTION_LOG
SET STATUS = 'SUCCESS',
END_TIME = CURRENT_TIMESTAMP()
WHERE API_NAME = '@{item().API_NAME}'
AND STATUS = 'RUNNING';
2f. Script Activity (UPDATE LOG – FAIL)
On the failure path (connected via the On Failure dependency), this Script Activity captures the error and marks the run as failed:
UPDATE EXECUTION_LOG
SET STATUS = 'FAILED',
END_TIME = CURRENT_TIMESTAMP(),
ERROR_MESSAGE = '@{activity('Copy_REST_to_JSON').error.message}'
WHERE API_NAME = '@{item().API_NAME}'
AND STATUS = 'RUNNING';
Metadata Table Structure
The API_METADATA table stores all configuration needed for each API source. Adding a new API requires only inserting a new row into this table:
CREATE TABLE API_METADATA (
API_ID INT,
API_NAME STRING,
BASE_URL STRING,
ENDPOINT STRING,
TABLE_NAME STRING,
IS_ACTIVE BOOLEAN,
LAST_RUN_TS TIMESTAMP
);
| Column | Description |
|---|---|
API_ID |
Unique identifier for the API source |
API_NAME |
Descriptive name of the API (e.g., "WeatherAPI", "ExchangeRates") |
BASE_URL |
The root URL of the API (e.g., "https://api.example.com") |
ENDPOINT |
The specific API endpoint path (e.g., "/v1/data") |
TABLE_NAME |
Snowflake target table name where data will be loaded |
IS_ACTIVE |
Boolean flag to enable or disable an API without deleting the row |
LAST_RUN_TS |
Timestamp of the last successful execution for incremental tracking |
Execution Log Table
The EXECUTION_LOG table captures the runtime status of every pipeline execution, providing complete observability:
CREATE TABLE EXECUTION_LOG (
LOG_ID INT AUTOINCREMENT,
API_NAME STRING,
TABLE_NAME STRING,
STATUS STRING,
START_TIME TIMESTAMP,
END_TIME TIMESTAMP,
ERROR_MESSAGE STRING
);
| Column | Description |
|---|---|
LOG_ID |
Auto-incrementing primary key for each log entry |
API_NAME |
Name of the API that was processed |
TABLE_NAME |
Target Snowflake table for this API |
STATUS |
Execution status: RUNNING, SUCCESS, or FAILED |
START_TIME |
Timestamp when the pipeline started processing this API |
END_TIME |
Timestamp when the processing completed or failed |
ERROR_MESSAGE |
Captured error message if the execution failed (NULL on success) |
Key Benefits
- Metadata-Driven Configuration: Add new REST APIs by simply inserting a row into the
API_METADATAtable. No pipeline modifications, no redeployment, and no code changes are required. - Full Error Handling with Execution Logging: Every pipeline run is tracked with start time, end time, status, and error messages. The On Failure path ensures that failures are captured and logged rather than silently ignored.
- Dynamic Table Creation: The pipeline automatically creates Snowflake tables on first execution, eliminating the need for manual DDL scripts and reducing operational overhead.
- VARIANT Column Preserves Full JSON Structure: By storing API responses in Snowflake VARIANT columns, you preserve the complete JSON hierarchy. This allows flexible downstream querying using Snowflake dot notation and LATERAL FLATTEN without defining a rigid schema upfront.
- Time-Window Parameter for Incremental Loads: The
END_TSpipeline parameter supports incremental data extraction, allowing the pipeline to fetch only new data since the last run. - Scalable and Maintainable: The ForEach pattern scales horizontally as you add more APIs. The separation of configuration from logic makes the pipeline easy to maintain and extend.
Conclusion
Building a metadata-driven Azure Data Factory pipeline for REST API ingestion into Snowflake provides a scalable, maintainable, and production-ready architecture for enterprise data integration. By storing API configurations in a metadata table, the pipeline eliminates hardcoded values and makes it trivial to onboard new data sources.
The combination of ADF's ForEach activity for parallel processing, ADLS Gen2 for intermediate JSON storage, and Snowflake's VARIANT data type for flexible semi-structured data handling creates a robust end-to-end data ingestion framework. The comprehensive execution logging with success and failure tracking ensures full observability and simplifies troubleshooting.
This pattern is particularly valuable for organizations that need to ingest data from dozens or hundreds of REST APIs, where maintaining individual pipelines for each source would be impractical. With this metadata-driven approach, your data engineering team can focus on data quality and analytics rather than pipeline maintenance.
Frequently Asked Questions
Q: What is a metadata-driven pipeline in Azure Data Factory?
A metadata-driven pipeline reads its configuration (API URLs, table names, parameters) from a metadata table rather than hardcoding values. This allows you to add new data sources by inserting rows into the metadata table without modifying the pipeline. In this architecture, the API_METADATA table in Snowflake stores all API configurations, and the ADF Lookup activity reads these at runtime to drive the ForEach processing loop.
Q: Why use VARIANT columns in Snowflake for JSON data?
Snowflake's VARIANT data type natively stores semi-structured data like JSON. It preserves the full JSON structure and allows you to query nested fields using dot notation and LATERAL FLATTEN, eliminating the need to define a rigid schema upfront. This is especially useful for REST API responses where the JSON structure may vary between APIs or change over time without requiring table schema modifications.
Q: How does the pipeline handle API failures?
The pipeline uses ADF's built-in error handling with On Failure dependency paths. If the REST Copy Activity or the Snowflake load Copy Activity fails, execution flows to a failure Script Activity that logs the error message and marks the status as FAILED in the EXECUTION_LOG table. This allows operations teams to monitor failures, investigate root causes, and retry specific APIs without rerunning the entire pipeline.