Data Vault 2.0: When Star Schema Isn't Enough for Your Data Warehouse

Celestinfo Software Solutions Pvt. Ltd. Jul 03, 2025

Quick answer: Data Vault 2.0 uses a Hub-Link-Satellite architecture to build agile, auditable data warehouses. Choose it over star schema when you have 5+ source systems with conflicting business keys, strict audit requirements, or need to load data before business rules are defined. Use the dbtvault package with Snowflake for implementation. Be aware: you always need a business vault or information mart on top for reporting -- the raw vault is not query-friendly.

Last updated: July 2025

When Star Schema Breaks Down

Data Vault 2.0 solves three problems star schema cannot: integrating 10+ source systems without redesigning dimensions, tracking full audit history of every change, and loading data in parallel without merge conflicts. It uses three table types — Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with timeline). Here’s when to use it, when to stick with star schema, and the exact DDL patterns.. The other 80% involve merging data from 5, 10, or 30 source systems where "customer ID" means something different in each one, business rules change quarterly, and auditors want to trace every data point back to its source.

Data Vault was designed for exactly this mess. It separates the concerns of business key identification, relationship tracking, and descriptive attribute storage into three distinct entity types -- and it never deletes or updates anything. If you're building a star schema in Snowflake and running into problems with source system volatility or audit trails, Data Vault might be what you need.

The Three Core Entity Types

Hubs: Business Key Registry

A Hub stores the unique business keys for a core business concept -- customers, products, orders, employees. Each Hub record contains: a hash key (surrogate), the business key value, the load timestamp, and the record source. That's it. No descriptive attributes. The Hub answers one question: "What business entities have we ever seen?"


SQL -- Customer Hub Table
CREATE TABLE raw_vault.hub_customer (
    customer_hk     BINARY(16)    NOT NULL,  -- MD5 hash of customer_id
    customer_id     VARCHAR(50)   NOT NULL,  -- The actual business key
    load_ts         TIMESTAMP_NTZ NOT NULL,
    record_source   VARCHAR(100)  NOT NULL,
    CONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk)
);

Links: Relationship Tracking

A Link captures the relationship between two or more Hubs. An order Link connects a Customer Hub to a Product Hub. A Link table contains: a hash key for the relationship, the hash keys of each participating Hub, load timestamp, and record source. Links are many-to-many by default, which makes them flexible enough to model any relationship without restructuring.


SQL -- Order Link Table
CREATE TABLE raw_vault.link_order (
    order_hk        BINARY(16)    NOT NULL,  -- MD5(customer_hk + product_hk + order_id)
    customer_hk     BINARY(16)    NOT NULL,
    product_hk      BINARY(16)    NOT NULL,
    order_id        VARCHAR(50)   NOT NULL,
    load_ts         TIMESTAMP_NTZ NOT NULL,
    record_source   VARCHAR(100)  NOT NULL,
    CONSTRAINT pk_link_order PRIMARY KEY (order_hk)
);

Satellites: Descriptive Attributes with History

Satellites store the descriptive attributes and their full change history. A Customer Hub might have two Satellites: one from the CRM system (name, email, phone) and one from the billing system (address, credit limit, payment terms). Each Satellite record has: the parent Hub or Link hash key, the load timestamp, a hash diff (hash of all attribute values for change detection), and the actual attribute columns.


SQL -- Customer CRM Satellite
CREATE TABLE raw_vault.sat_customer_crm (
    customer_hk     BINARY(16)    NOT NULL,
    load_ts         TIMESTAMP_NTZ NOT NULL,
    hash_diff       BINARY(16)    NOT NULL,  -- MD5 of all attributes
    customer_name   VARCHAR(200),
    email           VARCHAR(200),
    phone           VARCHAR(50),
    record_source   VARCHAR(100)  NOT NULL,
    CONSTRAINT pk_sat_customer_crm PRIMARY KEY (customer_hk, load_ts)
);

The key insight: Satellites are insert-only. When a customer's email changes, you don't update the existing row. You insert a new row with the new email and a fresh load_ts. The old row stays. This gives you complete history without any additional SCD logic -- history tracking is baked into the architecture.

Hash Keys: Why They Matter

Data Vault 2.0 uses MD5 or SHA-1 hashes of business keys as surrogate keys. This seems odd until you consider parallel loading. With identity columns, you need a centralized sequence generator -- a bottleneck when you're loading 20 tables simultaneously. Hash keys are deterministic: any ETL process running on any server calculates the same hash for the same business key. No coordination needed.

Hash keys also provide consistent-length join columns regardless of whether the business key is a 3-character code or a 50-character composite. In Snowflake, BINARY(16) joins are fast and index-friendly.

Loading Patterns

Data Vault loading follows two rules: insert-only and parallel. You never update or delete records in the raw vault. Every load is an INSERT of new records that don't already exist (for Hubs and Links) or records where the hash_diff has changed (for Satellites).

Because each Hub, Link, and Satellite is independent, you can load them all in parallel. A typical load pattern processes staging tables in this order: all Hubs first (in parallel), then all Links (in parallel), then all Satellites (in parallel). The Hubs must complete before Links because Links reference Hub hash keys. With dbt and Snowflake, you define these dependencies in your DAG and dbt handles the execution order.

Concrete Example: Sales System

Consider a typical sales system with customers, products, and orders. Here's how the Data Vault model breaks down:



Notice how the CRM and ERP data about the same customer goes into separate Satellites. If the CRM says the customer's name is "Acme Corp" and the ERP says "ACME Corporation," both versions are preserved. Business rules for reconciliation get applied later, in the business vault layer -- not during loading.

Building a Business Vault Layer

The raw vault stores data as-is from source systems. The business vault applies business rules on top. Three common business vault structures:



Critical gotcha: The raw vault is not query-friendly. Reporting against it requires joining 8-15 tables with temporal logic. You always need either a business vault or an information mart (star schema) on top. Teams that skip this layer and try to report directly from the raw vault end up with SQL queries that are 200+ lines long and take 10 minutes to run. Budget for the business vault -- it typically adds 30-40% to development time.

Implementing in Snowflake with dbt

The dbtvault package (now called automate-dv) provides dbt macros for all Data Vault loading patterns. Install it in your packages.yml:


YAML -- packages.yml
packages:
 - package: Datavault-UK/automate_dv
    version: [">=0.10.0", "<0.11.0"]

Define your Hub loading model:

SQL -- models/raw_vault/hub_customer.sql
{%- set source_model = "stg_crm_customers" -%}
{%- set src_pk = "customer_hk" -%}
{%- set src_nk = "customer_id" -%}
{%- set src_ldts = "load_ts" -%}
{%- set src_source = "record_source" -%}

{{ automate_dv.hub(src_pk=src_pk,
                   src_nk=src_nk,
                   src_ldts=src_ldts,
                   src_source=src_source,
                   source_model=source_model) }}

The macro generates the correct MERGE statement that inserts only new business keys. Similar macros exist for Links (automate_dv.link) and Satellites (automate_dv.sat). For access control strategies, apply Snowflake's RBAC on the raw vault schema to restrict who can see source-level data versus the governed business vault.

When NOT to Use Data Vault

Key Takeaways

Frequently Asked Questions

Q: What is the difference between Data Vault and star schema?

Star schema is optimized for query performance and reporting, organizing data into fact and dimension tables. Data Vault is optimized for loading and auditing, using Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with full history). Star schema is simpler to query; Data Vault handles source system changes and audit requirements better.

Q: Can I query a Data Vault raw layer directly for reporting?

Technically yes, but practically no. The raw vault is normalized into many small tables, making reporting queries extremely complex with dozens of joins. You always need a business vault or information mart layer on top that denormalizes the data into query-friendly structures like star schemas or wide tables.

Q: What are hash keys in Data Vault and why use them?

Hash keys are MD5 or SHA-1 hashes of business key columns used as surrogate keys in Data Vault tables. They enable sequence-independent parallel loading because any ETL process can calculate the same hash without needing a centralized sequence generator. They also provide consistent-length join keys regardless of business key complexity.

Q: How do I implement Data Vault in Snowflake with dbt?

Use the dbtvault (now called automate-dv) package for dbt. It provides macros for generating Hub, Link, and Satellite loading patterns. You define your source mappings in YAML, specify business keys and hash columns, and the package generates the correct INSERT-only loading SQL for Snowflake. This eliminates most of the boilerplate code.

Ameer, Data Governance Specialist

Ameer is a Data Governance Specialist at CelestInfo focused on RBAC, compliance frameworks, data security, and access control across cloud data platforms.

Related Articles

Burning Questions
About CelestInfo

Simple answers to make things clear.

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

Insights are updated in real-time as new data becomes available.

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.