Building a Data Quality Framework: From Random Checks to Systematic Trust

Celestinfo Software Solutions Pvt. Ltd. Jul 10, 2025

Quick answer: Stop writing one-off SQL checks. Build systematic data quality using four layers: schema tests (dbt generic tests for nulls, uniqueness, referential integrity), custom business logic tests (row count ranges, metric thresholds), automated profiling (detect schema drift and distribution changes), and freshness monitoring (is the data actually up to date?). Focus tests on the 20–30 columns that drive business decisions, not every field in every table. The biggest threat to your quality program isn't bad data - it's false positive alerts that make your team stop paying attention.

Last updated: August 2025

Why Ad-Hoc Quality Checks Always Fail

Here's the pattern we see in every company without a quality framework: someone finds a data issue, writes a SQL query to check for it, runs it once, pastes the result in Slack, and moves on. That query never runs again. Two months later, the same issue resurfaces - or a worse variant of it - and the cycle repeats.

Ad-hoc checks fail because they're reactive, undocumented, and unmaintained. They exist in someone's query history or a random Slack thread. They don't run automatically, they don't alert anyone, and they don't block bad data from reaching dashboards. You need a system, not a habit.

The Six Dimensions of Data Quality

Before building checks, you need a shared vocabulary for what "quality" means. There are six standard dimensions:

Not all dimensions matter equally for every table. Revenue tables need accuracy and completeness above all. User tables need uniqueness. Event tables need timeliness. Prioritize based on business impact.

Building Systematic Checks

Layer 1: Schema-Level Tests (dbt Generic Tests)

If you're using dbt for transformations, you already have a testing framework built in. The four generic tests cover the basics:

-- schema.yml
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']

These run during every dbt build and block downstream models when they fail. That's the key difference from ad-hoc checks - failures stop the pipeline, not just generate a warning.

Layer 2: Custom Business Logic Tests

Generic tests catch structural issues. Custom tests catch business logic violations. In dbt, these are SQL files in your tests/ directory that return failing rows:

-- tests/assert_order_total_matches_line_items.sql
SELECT order_id, order_total, line_item_sum
FROM (
  SELECT
    o.order_id,
    o.order_total,
    SUM(li.amount) AS line_item_sum
  FROM {{ ref('fct_orders') }} o
  JOIN {{ ref('fct_line_items') }} li ON o.order_id = li.order_id
  GROUP BY 1, 2
) WHERE ABS(order_total - line_item_sum) > 0.01

Other custom tests worth implementing: daily row count within expected range (catches both missing and duplicated loads), metric values within N standard deviations of the 30-day average, and cross-table consistency checks.

Layer 3: Automated Profiling and Anomaly Detection

Tests catch known failure modes. Profiling catches unknown ones. Automated profiling monitors:

Layer 4: Freshness Monitoring

The most overlooked dimension. Your dashboard shows "data updated today" but the underlying table was actually last loaded 3 days ago because the pipeline silently failed. dbt has built-in source freshness checks:

-- sources.yml
sources:
  - name: raw_stripe
    loaded_at_field: _loaded_at
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}

Alerting Strategy: Don't Page at 2am for a Warning

Not all quality issues deserve the same response. Define severity levels:

The Data Quality Scorecard

Track quality metrics over time and report to stakeholders monthly. A simple scorecard includes: tests passing rate (target: 99%+), average data freshness by source, incidents by severity over time, and mean time to detection/resolution. This transforms quality from an invisible background task into something leadership can see, fund, and prioritize.

The Organizational Component

Tools alone don't create quality. You need:

Tool Comparison

The False Positive Gotcha

False positives kill trust in quality alerts faster than actual quality issues. If your Slack channel fires 15 alerts a day and 13 of them are noise, people stop checking the channel within 2 weeks. Then the 2 real issues also get ignored. Treat every false positive as a bug: either tighten the threshold, add an exception, or remove the check. A small suite of reliable alerts is worth more than a comprehensive suite nobody trusts.


Key Takeaways


Ameer, Data Governance Specialist

Ameer is a Data Governance Specialist at CelestInfo focused on RBAC, compliance frameworks, data security, and access control across cloud data platforms.

Related Articles

Frequently Asked Questions

What are the six dimensions of data quality?

The six dimensions are: completeness (are all expected records present?), accuracy (do values reflect reality?), consistency (do related fields agree across tables?), timeliness (is data fresh enough for its use case?), uniqueness (are there duplicate records?), and validity (do values conform to expected formats and ranges?). Focus on the dimensions that matter most for your specific business context.

How do dbt tests work for data quality?

dbt has four built-in generic tests: unique, not_null, accepted_values, and relationships. You declare them in your YAML schema files and they run as SQL queries against your warehouse during every dbt build. You can also write custom tests as SQL queries that return rows when they fail. Tests block downstream models from building if they fail.

Should I test every column in my data warehouse?

No. Testing 100% of columns creates alert fatigue and maintenance overhead that kills adoption. Focus on the 20-30 columns that drive business decisions: primary keys, foreign keys, revenue figures, dates, and status fields. A focused test suite that gets maintained is worth more than a comprehensive one that gets ignored.

What is the biggest risk to a data quality program?

False positives. If your quality alerts fire for non-issues, people stop trusting them within 2-3 weeks. Once the team starts ignoring alerts, real issues slip through unnoticed. Tune your thresholds carefully, start conservative, and treat every false positive as a bug that needs fixing.

Burning Questions
About CelestInfo

Simple answers to make things clear.

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

Insights are updated in real-time as new data becomes available.

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.