Data Vault 2.0: When Star Schema Isn't Enough for Your Data Warehouse
Quick answer: Data Vault 2.0 uses a Hub-Link-Satellite architecture to build agile, auditable data warehouses. Choose it over star schema when you have 5+ source systems with conflicting business keys, strict audit requirements, or need to load data before business rules are defined. Use the dbtvault package with Snowflake for implementation. Be aware: you always need a business vault or information mart on top for reporting -- the raw vault is not query-friendly.
Last updated: July 2025
When Star Schema Breaks Down
Data Vault 2.0 solves three problems star schema cannot: integrating 10+ source systems without redesigning dimensions, tracking full audit history of every change, and loading data in parallel without merge conflicts. It uses three table types — Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with timeline). Here’s when to use it, when to stick with star schema, and the exact DDL patterns.. The other 80% involve merging data from 5, 10, or 30 source systems where "customer ID" means something different in each one, business rules change quarterly, and auditors want to trace every data point back to its source.
Data Vault was designed for exactly this mess. It separates the concerns of business key identification, relationship tracking, and descriptive attribute storage into three distinct entity types -- and it never deletes or updates anything. If you're building a star schema in Snowflake and running into problems with source system volatility or audit trails, Data Vault might be what you need.
The Three Core Entity Types
Hubs: Business Key Registry
A Hub stores the unique business keys for a core business concept -- customers, products, orders, employees. Each Hub record contains: a hash key (surrogate), the business key value, the load timestamp, and the record source. That's it. No descriptive attributes. The Hub answers one question: "What business entities have we ever seen?"
CREATE TABLE raw_vault.hub_customer (
customer_hk BINARY(16) NOT NULL, -- MD5 hash of customer_id
customer_id VARCHAR(50) NOT NULL, -- The actual business key
load_ts TIMESTAMP_NTZ NOT NULL,
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk)
);
Links: Relationship Tracking
A Link captures the relationship between two or more Hubs. An order Link connects a Customer Hub to a Product Hub. A Link table contains: a hash key for the relationship, the hash keys of each participating Hub, load timestamp, and record source. Links are many-to-many by default, which makes them flexible enough to model any relationship without restructuring.
CREATE TABLE raw_vault.link_order (
order_hk BINARY(16) NOT NULL, -- MD5(customer_hk + product_hk + order_id)
customer_hk BINARY(16) NOT NULL,
product_hk BINARY(16) NOT NULL,
order_id VARCHAR(50) NOT NULL,
load_ts TIMESTAMP_NTZ NOT NULL,
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_link_order PRIMARY KEY (order_hk)
);
Satellites: Descriptive Attributes with History
Satellites store the descriptive attributes and their full change history. A Customer Hub might have two Satellites: one from the CRM system (name, email, phone) and one from the billing system (address, credit limit, payment terms). Each Satellite record has: the parent Hub or Link hash key, the load timestamp, a hash diff (hash of all attribute values for change detection), and the actual attribute columns.
CREATE TABLE raw_vault.sat_customer_crm (
customer_hk BINARY(16) NOT NULL,
load_ts TIMESTAMP_NTZ NOT NULL,
hash_diff BINARY(16) NOT NULL, -- MD5 of all attributes
customer_name VARCHAR(200),
email VARCHAR(200),
phone VARCHAR(50),
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_sat_customer_crm PRIMARY KEY (customer_hk, load_ts)
);
The key insight: Satellites are insert-only. When a customer's email changes, you don't update the existing row. You insert a new row with the new email and a fresh load_ts. The old row stays. This gives you complete history without any additional SCD logic -- history tracking is baked into the architecture.
Hash Keys: Why They Matter
Data Vault 2.0 uses MD5 or SHA-1 hashes of business keys as surrogate keys. This seems odd until you consider parallel loading. With identity columns, you need a centralized sequence generator -- a bottleneck when you're loading 20 tables simultaneously. Hash keys are deterministic: any ETL process running on any server calculates the same hash for the same business key. No coordination needed.
Hash keys also provide consistent-length join columns regardless of whether the business key is a 3-character code or a 50-character composite. In Snowflake, BINARY(16) joins are fast and index-friendly.
Loading Patterns
Data Vault loading follows two rules: insert-only and parallel. You never update or delete records in the raw vault. Every load is an INSERT of new records that don't already exist (for Hubs and Links) or records where the hash_diff has changed (for Satellites).
Because each Hub, Link, and Satellite is independent, you can load them all in parallel. A typical load pattern processes staging tables in this order: all Hubs first (in parallel), then all Links (in parallel), then all Satellites (in parallel). The Hubs must complete before Links because Links reference Hub hash keys. With dbt and Snowflake, you define these dependencies in your DAG and dbt handles the execution order.
Concrete Example: Sales System
Consider a typical sales system with customers, products, and orders. Here's how the Data Vault model breaks down:
- Hub_Customer -- business key: customer_id
- Hub_Product -- business key: product_sku
- Link_Order -- connects Hub_Customer + Hub_Product, keyed by order_id
- Sat_Customer_CRM -- name, email, phone from CRM system
- Sat_Customer_Billing -- address, credit_limit, payment_terms from ERP
- Sat_Product_Catalog -- product_name, category, price from product catalog
- Sat_Order_Details -- quantity, unit_price, discount, order_date from order system
Notice how the CRM and ERP data about the same customer goes into separate Satellites. If the CRM says the customer's name is "Acme Corp" and the ERP says "ACME Corporation," both versions are preserved. Business rules for reconciliation get applied later, in the business vault layer -- not during loading.
Building a Business Vault Layer
The raw vault stores data as-is from source systems. The business vault applies business rules on top. Three common business vault structures:
- Calculated Satellites: Derived attributes computed from raw Satellites (e.g., customer lifetime value calculated from order history)
- Point-in-Time (PIT) tables: Pre-joined snapshots that combine a Hub with its latest Satellite records at specific points in time, eliminating expensive temporal joins at query time
- Bridge tables: Pre-resolved many-to-many relationships between Hubs, flattening Link traversals for easier querying
Critical gotcha: The raw vault is not query-friendly. Reporting against it requires joining 8-15 tables with temporal logic. You always need either a business vault or an information mart (star schema) on top. Teams that skip this layer and try to report directly from the raw vault end up with SQL queries that are 200+ lines long and take 10 minutes to run. Budget for the business vault -- it typically adds 30-40% to development time.
Implementing in Snowflake with dbt
The dbtvault package (now called automate-dv) provides dbt macros for all Data Vault loading patterns. Install it in your packages.yml:
packages:
- package: Datavault-UK/automate_dv
version: [">=0.10.0", "<0.11.0"]
Define your Hub loading model:
{%- set source_model = "stg_crm_customers" -%}
{%- set src_pk = "customer_hk" -%}
{%- set src_nk = "customer_id" -%}
{%- set src_ldts = "load_ts" -%}
{%- set src_source = "record_source" -%}
{{ automate_dv.hub(src_pk=src_pk,
src_nk=src_nk,
src_ldts=src_ldts,
src_source=src_source,
source_model=source_model) }}
The macro generates the correct MERGE statement that inserts only new business keys. Similar macros exist for Links (automate_dv.link) and Satellites (automate_dv.sat). For access control strategies, apply Snowflake's RBAC on the raw vault schema to restrict who can see source-level data versus the governed business vault.
When NOT to Use Data Vault
- Simple analytics needs: If you have 1-3 source systems and well-defined requirements, star schema is simpler and faster to deliver
- Small team without DV experience: Data Vault has a steep learning curve. A 2-person team with no DV experience will spend months learning before they're productive
- Fast dashboard delivery matters most: If your stakeholders need a dashboard in 2 weeks, build a star schema. You can always refactor to Data Vault later if complexity grows
- Read-heavy, write-light workloads: If you're loading data once a day and running 500 queries, optimize for reads (star schema), not writes (Data Vault)
Key Takeaways
- Data Vault 2.0 separates concerns into Hubs (business keys), Links (relationships), and Satellites (attributes with history)
- Choose Data Vault when you have 5+ volatile source systems, strict audit requirements, or need to load before defining business rules
- Hash keys enable parallel loading without centralized sequence generators
- Raw vault is insert-only -- history tracking is architectural, not procedural
- You always need a business vault or information mart on top for reporting -- budget 30-40% extra development time
- The dbtvault/automate-dv package for dbt eliminates most boilerplate code for Snowflake implementation
- Don't use Data Vault for simple projects -- star schema is faster to deliver when requirements are clear
Frequently Asked Questions
Q: What is the difference between Data Vault and star schema?
Star schema is optimized for query performance and reporting, organizing data into fact and dimension tables. Data Vault is optimized for loading and auditing, using Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with full history). Star schema is simpler to query; Data Vault handles source system changes and audit requirements better.
Q: Can I query a Data Vault raw layer directly for reporting?
Technically yes, but practically no. The raw vault is normalized into many small tables, making reporting queries extremely complex with dozens of joins. You always need a business vault or information mart layer on top that denormalizes the data into query-friendly structures like star schemas or wide tables.
Q: What are hash keys in Data Vault and why use them?
Hash keys are MD5 or SHA-1 hashes of business key columns used as surrogate keys in Data Vault tables. They enable sequence-independent parallel loading because any ETL process can calculate the same hash without needing a centralized sequence generator. They also provide consistent-length join keys regardless of business key complexity.
Q: How do I implement Data Vault in Snowflake with dbt?
Use the dbtvault (now called automate-dv) package for dbt. It provides macros for generating Hub, Link, and Satellite loading patterns. You define your source mappings in YAML, specify business keys and hash columns, and the package generates the correct INSERT-only loading SQL for Snowflake. This eliminates most of the boilerplate code.
