Star Schema in Snowflake: Why It Still Works and How to Build One Right
Last updated: February 2025
Quick answer: Star schemas work exceptionally well in Snowflake. The columnar storage and micro-partition pruning make denormalized dimension joins fast without indexes. Define your fact table grain first, build dimension tables with surrogate keys, use MERGE for SCD Type 2 updates, and organize it all in dbt with staging/intermediate/mart layers.
Star Schema Isn't Dead. It's the Default.
Every few years, someone declares dimensional modeling dead. First it was "schema-on-read will replace everything." Then it was "just dump it in the data lake." Neither worked out. Analysts still need to understand the data, and nothing beats a well-designed star schema for that. Snowflake's architecture actually makes star schemas perform better than they did in traditional row-based databases, because columnar storage and automatic micro-partition pruning do what indexes and materialized views used to do - without the maintenance overhead.
Why Star Schemas Work in Snowflake
Snowflake stores data in micro-partitions (50-500MB compressed, columnar). When you run a query that filters on WHERE order_date BETWEEN '2026-01-01' AND '2026-01-31', Snowflake checks the min/max metadata for the order_date column in each micro-partition and skips any partition that doesn't contain January 2026 data. This is automatic - no indexes to create, no VACUUM to run.
In a star schema, your fact table is naturally ordered by time (because transactions happen in time order). This means the order_date column in your fact table is naturally clustered within micro-partitions, giving you excellent pruning on date-range queries without any explicit clustering keys. Dimension tables are small enough (usually under 1M rows) that Snowflake broadcast-joins them efficiently.
Designing Dimension Tables
SCD Type 1: Overwrite
Type 1 is simple: when a dimension attribute changes, you overwrite the old value. The customer's address changed? Update the row. You lose history, but you keep things simple. Use Type 1 for attributes where historical tracking doesn't matter (email addresses, phone numbers, default settings).
MERGE INTO dim_customer AS target
USING stg_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
target.customer_name = source.customer_name,
target.email = source.email,
target.address = source.address,
target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (
customer_sk, customer_id, customer_name, email, address, created_at, updated_at
) VALUES (
dim_customer_seq.NEXTVAL, source.customer_id, source.customer_name,
source.email, source.address, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
);
SCD Type 2: Versioned Rows
Type 2 keeps history by creating a new row for each change, with effective_from, effective_to, and is_current columns. When a customer moves from New York to Chicago, the old row gets effective_to = '2026-02-26' and is_current = FALSE, and a new row is inserted with effective_from = '2026-02-27' and is_current = TRUE. This lets analysts answer questions like "what was the customer's region when they placed order #12345?"
Designing Fact Tables
Define the Grain First
The grain is the most important decision in your fact table design. It defines what one row represents. For a sales analytics star schema, the grain might be "one line item per order per product." Writing this down explicitly prevents scope creep and keeps the table useful.
Wrong grain: "One row per order." This hides product-level detail and forces aggregation at query time. Right grain: "One row per order line item." This gives analysts the flexibility to aggregate by product, category, customer, or any combination.
Handling Late-Arriving Facts
Facts don't always arrive on time. A sales transaction from yesterday might not hit your pipeline until tomorrow. Use MERGE with a unique key (typically the source system's transaction ID) so late-arriving facts get inserted without creating duplicates. In dbt, this is the incremental materialization with unique_key.
Degenerate Dimensions
Some dimension attributes live directly in the fact table because they don't justify their own dimension table. Order number is the classic example - it's a dimension (you filter and group by it) but there's no additional attributes to store in a separate table. Keep it in the fact table as a degenerate dimension.
Concrete Example: Sales Analytics Star Schema
Here's a complete example with customer, product, date, and store dimensions:
CREATE TABLE dim_date (
date_sk INT PRIMARY KEY,
full_date DATE NOT NULL,
year INT,
quarter INT,
month INT,
month_name VARCHAR(20),
week_of_year INT,
day_of_week INT,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
CREATE TABLE dim_customer (
customer_sk INT AUTOINCREMENT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200),
email VARCHAR(200),
city VARCHAR(100),
state VARCHAR(50),
region VARCHAR(50),
segment VARCHAR(50),
effective_from TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
effective_to TIMESTAMP_NTZ DEFAULT '9999-12-31',
is_current BOOLEAN DEFAULT TRUE
);
CREATE TABLE dim_product (
product_sk INT AUTOINCREMENT PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
product_name VARCHAR(300),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
unit_cost DECIMAL(10,2),
is_current BOOLEAN DEFAULT TRUE
);
CREATE TABLE dim_store (
store_sk INT AUTOINCREMENT PRIMARY KEY,
store_id VARCHAR(50) NOT NULL,
store_name VARCHAR(200),
city VARCHAR(100),
state VARCHAR(50),
region VARCHAR(50),
store_type VARCHAR(50),
open_date DATE
);
CREATE TABLE fct_sales (
sale_sk INT AUTOINCREMENT PRIMARY KEY,
-- Foreign keys to dimensions
date_sk INT NOT NULL REFERENCES dim_date(date_sk),
customer_sk INT NOT NULL REFERENCES dim_customer(customer_sk),
product_sk INT NOT NULL REFERENCES dim_product(product_sk),
store_sk INT NOT NULL REFERENCES dim_store(store_sk),
-- Degenerate dimension
order_number VARCHAR(50),
-- Measures
quantity INT,
unit_price DECIMAL(10,2),
discount_pct DECIMAL(5,2),
line_total DECIMAL(12,2),
-- Metadata
loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
Building with dbt: The Layer Pattern
The standard dbt project structure for a star schema uses three layers. This is the pattern we use on every project, and it maps directly to how dbt works with Snowflake:
- Staging layer (
models/staging/): One model per source table. Renames columns, casts types, filters out test/deleted records. Materialized as views. Example:stg_orders.sql,stg_customers.sql. - Intermediate layer (
models/intermediate/): Business logic, joins, SCD logic, deduplication. Example:int_customer_scd2.sqlhandles the Type 2 versioning logic. - Mart layer (
models/marts/): Final dimension and fact tables consumed by BI tools. Materialized as tables. Example:dim_customer.sql,fct_sales.sql.
Testing Your Model with dbt
Every star schema should have these dbt tests at minimum:
- Referential integrity: Test that every foreign key in the fact table has a matching row in the dimension table. Use dbt's
relationshipstest. - Primary key uniqueness: Test that surrogate keys are unique in every dimension table and fact table. Use
uniqueandnot_nulltests. - Freshness checks: Configure
sources.ymlwithloaded_at_fieldandfreshnessto alert when source data is stale. - Accepted values: Test that categorical columns (like
segmentorregion) only contain expected values.
Common Mistakes
- Over-normalization for a columnar warehouse. Snowflake is columnar - it only reads the columns you query. Having a few extra columns in a dimension table costs almost nothing in scan time. Don't create a separate
dim_city,dim_state, anddim_regionwhen they can all live indim_store. Over-normalization forces unnecessary joins that hurt performance more than the storage savings help. - Too many small dimension tables. If your dimension has 5 rows (like a status lookup table with "Active", "Inactive", "Pending"), it probably doesn't need its own table. Use a degenerate dimension in the fact table or a simple CASE statement in the dbt model.
- Not defining grain clearly. If two team members disagree about what one row in the fact table represents, your model is broken. Write the grain statement in the dbt model's YAML description. Make it unambiguous.
- Skipping surrogate keys. Natural keys from source systems change format, get recycled, or contain PII. Always use an integer surrogate key (
AUTOINCREMENT) for dimension primary keys and fact table foreign keys. - Ignoring late-arriving dimensions. When a fact arrives referencing a customer_id that doesn't exist in
dim_customeryet, what happens? If you don't handle this, you'll get referential integrity failures. Insert a placeholder dimension row (with "Unknown" values) and backfill it later.
Key Takeaways
- Star schemas perform well in Snowflake because columnar storage and micro-partition pruning replace the need for indexes.
- Define your fact table grain first - everything else flows from that decision.
- Use SCD Type 1 for most dimensions, Type 2 only where historical tracking matters to the business.
- Organize dbt models into staging (views), intermediate (logic), and mart (tables) layers.
- Test referential integrity, key uniqueness, and data freshness as part of every dbt run.
- Don't over-normalize. Snowflake's columnar engine makes wide tables cheap to scan. Extra joins are expensive.
Frequently Asked Questions
Q: Do star schemas still make sense in Snowflake?
Yes. Snowflake's columnar storage and micro-partition pruning make star schemas perform exceptionally well. Denormalized fact tables with dimension foreign keys allow the query optimizer to prune irrelevant partitions efficiently. Clean dimensional models are also easier for analysts to understand and query.
Q: Does Snowflake need indexes for star schema performance?
No. Snowflake doesn't support traditional indexes. Instead, it uses automatic micro-partition pruning based on column min/max metadata. For very large tables (1TB+), you can add clustering keys on frequently filtered columns, but most star schemas perform well without them.
Q: Should I use SCD Type 1 or Type 2 for dimension tables?
Use Type 1 (overwrite) when you don't need historical tracking and want simplicity. Use Type 2 (versioned rows with effective dates) when analysts need to see what a dimension value was at the time of a transaction. Many teams use Type 1 for most dimensions and Type 2 only for business-critical ones.
Q: How do I handle late-arriving facts in a star schema?
Use MERGE (upsert) with a unique key so late-arriving facts get inserted without creating duplicates. In dbt, use the incremental materialization with a unique_key configuration to handle this pattern automatically.
