Star Schema in Snowflake: Why It Still Works and How to Build One Right

Q: How do I handle late-arriving facts in a star schema?

Use MERGE (upsert) to handle late-arriving facts. When a fact arrives after its dimension context was already loaded, the MERGE statement inserts it if new or updates it if it already exists. In dbt, use the incremental materialization with a unique_key to handle this pattern automatically.

Pranay Vatsal

Published Jan 16, 2025 · Last updated Feb 2026

Celestinfo Software Solutions Pvt. Ltd. • Jan 16, 2025

Last updated: February 2025

Quick answer: Star schemas work exceptionally well in Snowflake. The columnar storage and micro-partition pruning make denormalized dimension joins fast without indexes. Define your fact table grain first, build dimension tables with surrogate keys, use MERGE for SCD Type 2 updates, and organize it all in dbt with staging/intermediate/mart layers.

Star Schema Isn't Dead. It's the Default.

Every few years, someone declares dimensional modeling dead. First it was "schema-on-read will replace everything." Then it was "just dump it in the data lake." Neither worked out. Analysts still need to understand the data, and nothing beats a well-designed star schema for that. Snowflake's architecture actually makes star schemas perform better than they did in traditional row-based databases, because columnar storage and automatic micro-partition pruning do what indexes and materialized views used to do - without the maintenance overhead.

Why Star Schemas Work in Snowflake

Snowflake stores data in micro-partitions (50-500MB compressed, columnar). When you run a query that filters on WHERE order_date BETWEEN '2026-01-01' AND '2026-01-31', Snowflake checks the min/max metadata for the order_date column in each micro-partition and skips any partition that doesn't contain January 2026 data. This is automatic - no indexes to create, no VACUUM to run.

In a star schema, your fact table is naturally ordered by time (because transactions happen in time order). This means the order_date column in your fact table is naturally clustered within micro-partitions, giving you excellent pruning on date-range queries without any explicit clustering keys. Dimension tables are small enough (usually under 1M rows) that Snowflake broadcast-joins them efficiently.

Designing Dimension Tables

SCD Type 1: Overwrite

Type 1 is simple: when a dimension attribute changes, you overwrite the old value. The customer's address changed? Update the row. You lose history, but you keep things simple. Use Type 1 for attributes where historical tracking doesn't matter (email addresses, phone numbers, default settings).

SQL - SCD Type 1 with MERGE

MERGE INTO dim_customer AS target
USING stg_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    target.customer_name = source.customer_name,
    target.email = source.email,
    target.address = source.address,
    target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (
    customer_sk, customer_id, customer_name, email, address, created_at, updated_at
) VALUES (
    dim_customer_seq.NEXTVAL, source.customer_id, source.customer_name,
    source.email, source.address, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
);

SCD Type 2: Versioned Rows

Type 2 keeps history by creating a new row for each change, with effective_from, effective_to, and is_current columns. When a customer moves from New York to Chicago, the old row gets effective_to = '2026-02-26' and is_current = FALSE, and a new row is inserted with effective_from = '2026-02-27' and is_current = TRUE. This lets analysts answer questions like "what was the customer's region when they placed order #12345?"

Designing Fact Tables

Define the Grain First

The grain is the most important decision in your fact table design. It defines what one row represents. For a sales analytics star schema, the grain might be "one line item per order per product." Writing this down explicitly prevents scope creep and keeps the table useful.

Wrong grain: "One row per order." This hides product-level detail and forces aggregation at query time. Right grain: "One row per order line item." This gives analysts the flexibility to aggregate by product, category, customer, or any combination.

Handling Late-Arriving Facts

Facts don't always arrive on time. A sales transaction from yesterday might not hit your pipeline until tomorrow. Use MERGE with a unique key (typically the source system's transaction ID) so late-arriving facts get inserted without creating duplicates. In dbt, this is the incremental materialization with unique_key.

Degenerate Dimensions

Some dimension attributes live directly in the fact table because they don't justify their own dimension table. Order number is the classic example - it's a dimension (you filter and group by it) but there's no additional attributes to store in a separate table. Keep it in the fact table as a degenerate dimension.

Concrete Example: Sales Analytics Star Schema

Here's a complete example with customer, product, date, and store dimensions:

SQL - Dimension tables

CREATE TABLE dim_date (
    date_sk         INT PRIMARY KEY,
    full_date       DATE NOT NULL,
    year            INT,
    quarter         INT,
    month           INT,
    month_name      VARCHAR(20),
    week_of_year    INT,
    day_of_week     INT,
    day_name        VARCHAR(20),
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN
);

CREATE TABLE dim_customer (
    customer_sk     INT AUTOINCREMENT PRIMARY KEY,
    customer_id     VARCHAR(50) NOT NULL,
    customer_name   VARCHAR(200),
    email           VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    region          VARCHAR(50),
    segment         VARCHAR(50),
    effective_from  TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
    effective_to    TIMESTAMP_NTZ DEFAULT '9999-12-31',
    is_current      BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_product (
    product_sk      INT AUTOINCREMENT PRIMARY KEY,
    product_id      VARCHAR(50) NOT NULL,
    product_name    VARCHAR(300),
    category        VARCHAR(100),
    subcategory     VARCHAR(100),
    brand           VARCHAR(100),
    unit_cost       DECIMAL(10,2),
    is_current      BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_store (
    store_sk        INT AUTOINCREMENT PRIMARY KEY,
    store_id        VARCHAR(50) NOT NULL,
    store_name      VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    region          VARCHAR(50),
    store_type      VARCHAR(50),
    open_date       DATE
);

SQL - Fact table

CREATE TABLE fct_sales (
    sale_sk         INT AUTOINCREMENT PRIMARY KEY,
    -- Foreign keys to dimensions
    date_sk         INT NOT NULL REFERENCES dim_date(date_sk),
    customer_sk     INT NOT NULL REFERENCES dim_customer(customer_sk),
    product_sk      INT NOT NULL REFERENCES dim_product(product_sk),
    store_sk        INT NOT NULL REFERENCES dim_store(store_sk),
    -- Degenerate dimension
    order_number    VARCHAR(50),
    -- Measures
    quantity        INT,
    unit_price      DECIMAL(10,2),
    discount_pct    DECIMAL(5,2),
    line_total      DECIMAL(12,2),
    -- Metadata
    loaded_at       TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

Building with dbt: The Layer Pattern

The standard dbt project structure for a star schema uses three layers. This is the pattern we use on every project, and it maps directly to how dbt works with Snowflake:

Staging layer (models/staging/): One model per source table. Renames columns, casts types, filters out test/deleted records. Materialized as views. Example: stg_orders.sql, stg_customers.sql.
Intermediate layer (models/intermediate/): Business logic, joins, SCD logic, deduplication. Example: int_customer_scd2.sql handles the Type 2 versioning logic.
Mart layer (models/marts/): Final dimension and fact tables consumed by BI tools. Materialized as tables. Example: dim_customer.sql, fct_sales.sql.

Testing Your Model with dbt

Every star schema should have these dbt tests at minimum:

Referential integrity: Test that every foreign key in the fact table has a matching row in the dimension table. Use dbt's relationships test.
Primary key uniqueness: Test that surrogate keys are unique in every dimension table and fact table. Use unique and not_null tests.
Freshness checks: Configure sources.yml with loaded_at_field and freshness to alert when source data is stale.
Accepted values: Test that categorical columns (like segment or region) only contain expected values.

Common Mistakes

Over-normalization for a columnar warehouse. Snowflake is columnar - it only reads the columns you query. Having a few extra columns in a dimension table costs almost nothing in scan time. Don't create a separate dim_city, dim_state, and dim_region when they can all live in dim_store. Over-normalization forces unnecessary joins that hurt performance more than the storage savings help.
Too many small dimension tables. If your dimension has 5 rows (like a status lookup table with "Active", "Inactive", "Pending"), it probably doesn't need its own table. Use a degenerate dimension in the fact table or a simple CASE statement in the dbt model.
Not defining grain clearly. If two team members disagree about what one row in the fact table represents, your model is broken. Write the grain statement in the dbt model's YAML description. Make it unambiguous.
Skipping surrogate keys. Natural keys from source systems change format, get recycled, or contain PII. Always use an integer surrogate key (AUTOINCREMENT) for dimension primary keys and fact table foreign keys.
Ignoring late-arriving dimensions. When a fact arrives referencing a customer_id that doesn't exist in dim_customer yet, what happens? If you don't handle this, you'll get referential integrity failures. Insert a placeholder dimension row (with "Unknown" values) and backfill it later.

Key Takeaways

Star schemas perform well in Snowflake because columnar storage and micro-partition pruning replace the need for indexes.
Define your fact table grain first - everything else flows from that decision.
Use SCD Type 1 for most dimensions, Type 2 only where historical tracking matters to the business.
Organize dbt models into staging (views), intermediate (logic), and mart (tables) layers.
Test referential integrity, key uniqueness, and data freshness as part of every dbt run.
Don't over-normalize. Snowflake's columnar engine makes wide tables cheap to scan. Extra joins are expensive.

Frequently Asked Questions

Q: Do star schemas still make sense in Snowflake?

Yes. Snowflake's columnar storage and micro-partition pruning make star schemas perform exceptionally well. Denormalized fact tables with dimension foreign keys allow the query optimizer to prune irrelevant partitions efficiently. Clean dimensional models are also easier for analysts to understand and query.

Q: Does Snowflake need indexes for star schema performance?

No. Snowflake doesn't support traditional indexes. Instead, it uses automatic micro-partition pruning based on column min/max metadata. For very large tables (1TB+), you can add clustering keys on frequently filtered columns, but most star schemas perform well without them.

Q: Should I use SCD Type 1 or Type 2 for dimension tables?

Use Type 1 (overwrite) when you don't need historical tracking and want simplicity. Use Type 2 (versioned rows with effective dates) when analysts need to see what a dimension value was at the time of a transaction. Many teams use Type 1 for most dimensions and Type 2 only for business-critical ones.

Q: How do I handle late-arriving facts in a star schema?

Use MERGE (upsert) with a unique key so late-arriving facts get inserted without creating duplicates. In dbt, use the incremental materialization with a unique_key configuration to handle this pattern automatically.

Pranay Vatsal, Founder & CEO

Pranay Vatsal is the Founder & CEO of CelestInfo with deep expertise in Snowflake, data architecture, and building production-grade data systems for global enterprises.

Burning Questions
About CelestInfo

Simple answers to make things clear.

How accurate are the AI insights?+

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Can I integrate with my existing tools?+

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

What security measures do you have?+

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

How often are insights updated?+

Insights are updated in real-time as new data becomes available.

What kind of support do you offer?+

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Star Schema in Snowflake: Why It Still Works and How to Build One Right

Star Schema Isn't Dead. It's the Default.

Why Star Schemas Work in Snowflake

Designing Dimension Tables

SCD Type 1: Overwrite

SCD Type 2: Versioned Rows

Designing Fact Tables

Define the Grain First

Handling Late-Arriving Facts

Degenerate Dimensions

Concrete Example: Sales Analytics Star Schema

Building with dbt: The Layer Pattern

Testing Your Model with dbt

Common Mistakes

Key Takeaways

Frequently Asked Questions

Q: Do star schemas still make sense in Snowflake?

Q: Does Snowflake need indexes for star schema performance?

Q: Should I use SCD Type 1 or Type 2 for dimension tables?

Q: How do I handle late-arriving facts in a star schema?

Related Articles

Burning Questions
About CelestInfo

Ready? Let's Talk!

Star Schema in Snowflake: Why It Still Works and How to Build One Right

Star Schema Isn't Dead. It's the Default.

Why Star Schemas Work in Snowflake

Designing Dimension Tables

SCD Type 1: Overwrite

SCD Type 2: Versioned Rows

Designing Fact Tables

Define the Grain First

Handling Late-Arriving Facts

Degenerate Dimensions

Concrete Example: Sales Analytics Star Schema

Building with dbt: The Layer Pattern

Testing Your Model with dbt

Common Mistakes

Key Takeaways

Frequently Asked Questions

Q: Do star schemas still make sense in Snowflake?

Q: Does Snowflake need indexes for star schema performance?

Q: Should I use SCD Type 1 or Type 2 for dimension tables?

Q: How do I handle late-arriving facts in a star schema?

Related Articles

Burning QuestionsAbout CelestInfo

Burning Questions
About CelestInfo