Monitoring Your Data Pipelines: What to Track and How to Set Up Alerts That Actually Help

Pranay Vatsal

Published Jun 19, 2025 · Last updated Feb 2026

Celestinfo Software Solutions Pvt. Ltd. • Jun 19, 2025

Last updated: July 2025

Quick answer: Every data pipeline should track 4 things: freshness (is data arriving on time), volume (are row counts within expected ranges), schema (have columns changed), and quality (null rates, uniqueness, range validation). Start with dbt tests and Snowflake's ACCOUNT_USAGE views for zero-cost monitoring. Avoid alert fatigue by only alerting on the critical path first, using tiered severity, and reviewing thresholds monthly. If an alert fires 3+ times without requiring action, it's noise.

Introduction

Most teams discover pipeline problems the same way: someone on the business team opens a dashboard, sees yesterday's numbers are missing, and sends a Slack message asking what happened. By then, the data has been stale for hours and the root cause is buried somewhere in a chain of 12 transformations. Pipeline monitoring flips this: you find out before the business team does, ideally before the data even reaches a dashboard. This guide covers what to monitor, how to implement it with tools you're probably already using, and how to avoid the alert fatigue that makes teams ignore monitoring entirely. For the broader governance picture, see our governance, security, and cost optimization guide.

The 4 Pillars of Pipeline Monitoring

1. Freshness: Is the Data Arriving on Time?

Freshness monitoring checks whether data has been updated within an expected window. Your orders table should have new rows from the last hour. Your daily aggregation should have today's date as the most recent partition. If it doesn't, something's broken upstream.

-- Snowflake: Check when each table was last loaded
SELECT
  table_schema,
  table_name,
  DATEDIFF('hour', last_altered, CURRENT_TIMESTAMP()) AS hours_since_update
FROM information_schema.tables
WHERE table_schema = 'MARTS'
  AND DATEDIFF('hour', last_altered, CURRENT_TIMESTAMP()) > 6
ORDER BY hours_since_update DESC;

In dbt, use dbt source freshness with your sources.yml configuration:

# sources.yml
sources:
 - name: raw
    database: production
    schema: raw_data
    tables:
 - name: orders
        loaded_at_field: _loaded_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}

Running dbt source freshness checks the max _loaded_at value. If it's older than 6 hours, you get a warning. Older than 12 hours, it's an error. Schedule this on a cron job and pipe failures to Slack.

2. Volume: Are Row Counts Within Expected Ranges?

A pipeline that succeeds but loads 0 rows is worse than one that fails - at least failures are obvious. Volume monitoring catches silent data loss: the API returned an empty response, a WHERE clause filtered everything out, or a source table was truncated.

-- Snowflake: Compare today's row count to 7-day average
WITH daily_counts AS (
  SELECT
    DATE(loaded_at) AS load_date,
    COUNT(*) AS row_count
  FROM staging.orders
  WHERE loaded_at > DATEADD('day', -8, CURRENT_DATE())
  GROUP BY 1
),
avg_count AS (
  SELECT AVG(row_count) AS avg_7d
  FROM daily_counts
  WHERE load_date < CURRENT_DATE()
)
SELECT
  dc.load_date,
  dc.row_count,
  ac.avg_7d,
  ROUND((dc.row_count - ac.avg_7d) / ac.avg_7d * 100, 1) AS pct_deviation
FROM daily_counts dc
CROSS JOIN avg_count ac
WHERE dc.load_date = CURRENT_DATE()
  AND ABS((dc.row_count - ac.avg_7d) / ac.avg_7d) > 0.5;  -- Alert if >50% deviation

The 50% threshold is a starting point. Tune it based on your data's natural variance. E-commerce orders might swing 30% between weekdays and weekends - don't alert on that. But a 50% drop on a Tuesday is suspicious.

3. Schema: Have Columns Changed?

Schema changes in source systems are the silent killer of data pipelines. A developer adds a column to the source API, renames a field, or changes a data type. Your pipeline doesn't fail immediately - it might succeed and quietly drop the new column, or coerce the type incorrectly. You find out weeks later when an analyst reports wrong numbers.

-- dbt test: assert expected columns exist
# schema.yml
models:
 - name: stg_orders
    columns:
 - name: order_id
        tests:
 - not_null
 - unique
 - name: order_date
        tests:
 - not_null
 - name: total_amount
        tests:
 - not_null
 - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000

For proactive schema change detection, query Snowflake's INFORMATION_SCHEMA.COLUMNS periodically and compare against a snapshot from the previous run. Any new columns, removed columns, or type changes get flagged before they cause downstream issues.

4. Data Quality: Are Values Valid?

Quality checks validate that data values make sense: null rates are within tolerance, foreign key relationships hold, numeric values fall within expected ranges, and categorical columns contain expected values.

-- dbt tests for quality
# schema.yml
models:
 - name: mart_orders
    columns:
 - name: customer_id
        tests:
 - not_null
 - relationships:
              to: ref('dim_customers')
              field: customer_id
 - name: status
        tests:
 - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
 - name: order_total
        tests:
 - dbt_utils.accepted_range:
              min_value: 0

Tool-Specific Monitoring

Snowflake: ACCOUNT_USAGE Views

Snowflake's ACCOUNT_USAGE schema has everything you need for warehouse-level monitoring. Key views:

QUERY_HISTORY - every query run against the account, including duration, rows produced, bytes scanned, and warehouse used
COPY_HISTORY - every COPY INTO operation, including row counts and error counts
TASK_HISTORY - execution history for Snowflake Tasks (scheduled SQL), including success/failure status
WAREHOUSE_METERING_HISTORY - credit consumption per warehouse, useful for cost monitoring

-- Find failed tasks in the last 24 hours
SELECT
  name,
  scheduled_time,
  error_code,
  error_message
FROM snowflake.account_usage.task_history
WHERE state = 'FAILED'
  AND scheduled_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY scheduled_time DESC;

dbt: Tests + Source Freshness

dbt's built-in testing framework is the cheapest monitoring you can set up. Run dbt test after every dbt run and pipe failures to your alerting system. For source freshness, schedule dbt source freshness independently of your transformation runs - you want to know if sources are stale even when the pipeline isn't running.

Great Expectations: Python Pipeline Quality

If your pipeline is Python-based (Airflow, Prefect, custom scripts), Great Expectations provides a framework for defining and running data quality checks. You define "expectations" (e.g., column X should never be null, column Y should have between 1K and 100K distinct values) and run them as a validation step in your pipeline.

Monte Carlo / Elementary: Automated Observability

For teams with 50+ tables and multiple data consumers, dedicated observability tools automate what you'd otherwise build manually. Monte Carlo (SaaS) uses ML to learn your data's normal patterns and alerts on anomalies without manual threshold configuration. Elementary (open-source, dbt-native) adds anomaly detection and monitoring dashboards directly to your dbt project. The tradeoff is cost (Monte Carlo) or setup complexity (Elementary) versus the manual effort of maintaining custom monitoring queries.

Alert Strategy: Avoiding Alert Fatigue

Alert fatigue is when your team starts ignoring alerts because too many of them are false positives or low priority. It happens fast - one noisy alert channel and people mute it within a week. To avoid this:

Start with the critical path only. Monitor only the tables that directly feed business dashboards or customer-facing features. Everything else can wait.
Use tiered severity. Warnings go to a Slack channel. Critical alerts page the on-call engineer via PagerDuty or Opsgenie. Don't page for a warning.
Escalate gradually. First alert is a Slack message. If unresolved for 30 minutes, escalate to a DM. Still unresolved after an hour, page.
Review monthly. Any alert that fired 3+ times in a month without requiring action needs its threshold adjusted or should be removed entirely.
Write runbooks. Every alert should link to a runbook that describes: what triggered it, likely root causes, and step-by-step resolution. If the on-call engineer has to investigate from scratch every time, the alert isn't complete.

Practical Example: Monitoring a 5-Table Pipeline

Here's what monitoring looks like for a real pipeline: source (raw.orders, raw.customers) through staging (stg_orders, stg_customers) to a mart (mart_order_summary). For dashboards that scale with growth, you need this kind of monitoring underneath.

Source freshness: dbt source freshness checks raw.orders (warn: 2 hours, error: 6 hours) and raw.customers (warn: 24 hours, error: 48 hours)
Volume checks: Custom Snowflake query comparing today's row counts in stg_orders against the 7-day average. Alert if deviation exceeds 40%
Schema tests: dbt tests on stg_orders and stg_customers asserting expected columns exist and have the correct types
Quality tests: dbt tests on mart_order_summary: order_total > 0, customer_id not null and exists in dim_customers, status is one of the expected values
Alerting: dbt test failures post to #data-alerts in Slack. Source freshness errors page the on-call via PagerDuty after 30 minutes unresolved

Key Takeaways

Monitor 4 things: freshness, volume, schema, and quality - they catch 95% of pipeline issues
dbt tests + dbt source freshness give you zero-cost monitoring for transformation pipelines
Snowflake's ACCOUNT_USAGE views (QUERY_HISTORY, TASK_HISTORY, COPY_HISTORY) cover warehouse-level monitoring
Volume anomaly detection (compare today vs. 7-day average) catches silent data loss that success/failure status misses
Only alert on the critical path; use tiered severity; review thresholds monthly
Every alert needs a runbook - if the on-call engineer has to investigate from scratch, the monitoring isn't finished
Dedicated tools (Monte Carlo, Elementary) make sense at scale (50+ tables); below that, dbt tests and custom queries work fine

CelestInfo Engineering Team

We build and monitor data pipelines so you find out about problems before your stakeholders do. About us

Frequently Asked Questions

What are the 4 key metrics to monitor in any data pipeline?

Every data pipeline should track freshness (is data arriving on time), volume (are row counts within expected ranges), schema (have columns been added, removed, or changed type), and quality (null rates, uniqueness, range validation). Freshness and volume catch most pipeline failures. Schema monitoring prevents silent downstream breakage. Quality checks catch data corruption.

How do I avoid alert fatigue with data pipeline monitoring?

Start by only alerting on the critical path - the tables that directly feed business dashboards or customer-facing applications. Use tiered severity (warning vs critical) and escalation rules. Send warnings to Slack, page only for critical issues. Review and tune alert thresholds monthly. If an alert fires more than 3 times without requiring action, it's noise and should be removed or adjusted.

What is dbt source freshness and how does it work?

dbt source freshness checks whether source tables have been updated recently. You define an expected freshness window (e.g., loaded_at_field should be no older than 6 hours) in your sources.yml file, and dbt compares the max value of that field against the current timestamp. If the data is staler than the threshold, dbt raises a warning or error.

Should I use a dedicated data observability tool like Monte Carlo or Elementary?

It depends on scale. If you have fewer than 20 critical tables, dbt tests plus custom Snowflake queries cover most needs at zero additional cost. Above 50 tables with multiple teams depending on the data, a dedicated tool like Monte Carlo (SaaS, ML-based anomaly detection) or Elementary (open-source, dbt-native) reduces the maintenance burden significantly and catches issues that manual thresholds miss.