Star Schema in Snowflake Data Modeling

Star Schema in Snowflake: Why It Still Works and How to Build One Right

Celestinfo Software Solutions Pvt. Ltd. Jan 16, 2025

Last updated: February 2025

Quick answer: Star schemas work exceptionally well in Snowflake. The columnar storage and micro-partition pruning make denormalized dimension joins fast without indexes. Define your fact table grain first, build dimension tables with surrogate keys, use MERGE for SCD Type 2 updates, and organize it all in dbt with staging/intermediate/mart layers.

Star Schema Isn't Dead. It's the Default.

Every few years, someone declares dimensional modeling dead. First it was "schema-on-read will replace everything." Then it was "just dump it in the data lake." Neither worked out. Analysts still need to understand the data, and nothing beats a well-designed star schema for that. Snowflake's architecture actually makes star schemas perform better than they did in traditional row-based databases, because columnar storage and automatic micro-partition pruning do what indexes and materialized views used to do - without the maintenance overhead.


Why Star Schemas Work in Snowflake


Snowflake stores data in micro-partitions (50-500MB compressed, columnar). When you run a query that filters on WHERE order_date BETWEEN '2026-01-01' AND '2026-01-31', Snowflake checks the min/max metadata for the order_date column in each micro-partition and skips any partition that doesn't contain January 2026 data. This is automatic - no indexes to create, no VACUUM to run.


In a star schema, your fact table is naturally ordered by time (because transactions happen in time order). This means the order_date column in your fact table is naturally clustered within micro-partitions, giving you excellent pruning on date-range queries without any explicit clustering keys. Dimension tables are small enough (usually under 1M rows) that Snowflake broadcast-joins them efficiently.


Designing Dimension Tables


SCD Type 1: Overwrite


Type 1 is simple: when a dimension attribute changes, you overwrite the old value. The customer's address changed? Update the row. You lose history, but you keep things simple. Use Type 1 for attributes where historical tracking doesn't matter (email addresses, phone numbers, default settings).


SQL - SCD Type 1 with MERGE
MERGE INTO dim_customer AS target
USING stg_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    target.customer_name = source.customer_name,
    target.email = source.email,
    target.address = source.address,
    target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (
    customer_sk, customer_id, customer_name, email, address, created_at, updated_at
) VALUES (
    dim_customer_seq.NEXTVAL, source.customer_id, source.customer_name,
    source.email, source.address, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
);

SCD Type 2: Versioned Rows


Type 2 keeps history by creating a new row for each change, with effective_from, effective_to, and is_current columns. When a customer moves from New York to Chicago, the old row gets effective_to = '2026-02-26' and is_current = FALSE, and a new row is inserted with effective_from = '2026-02-27' and is_current = TRUE. This lets analysts answer questions like "what was the customer's region when they placed order #12345?"


Designing Fact Tables


Define the Grain First


The grain is the most important decision in your fact table design. It defines what one row represents. For a sales analytics star schema, the grain might be "one line item per order per product." Writing this down explicitly prevents scope creep and keeps the table useful.


Wrong grain: "One row per order." This hides product-level detail and forces aggregation at query time. Right grain: "One row per order line item." This gives analysts the flexibility to aggregate by product, category, customer, or any combination.


Handling Late-Arriving Facts


Facts don't always arrive on time. A sales transaction from yesterday might not hit your pipeline until tomorrow. Use MERGE with a unique key (typically the source system's transaction ID) so late-arriving facts get inserted without creating duplicates. In dbt, this is the incremental materialization with unique_key.


Degenerate Dimensions


Some dimension attributes live directly in the fact table because they don't justify their own dimension table. Order number is the classic example - it's a dimension (you filter and group by it) but there's no additional attributes to store in a separate table. Keep it in the fact table as a degenerate dimension.


Concrete Example: Sales Analytics Star Schema


Here's a complete example with customer, product, date, and store dimensions:


SQL - Dimension tables
CREATE TABLE dim_date (
    date_sk         INT PRIMARY KEY,
    full_date       DATE NOT NULL,
    year            INT,
    quarter         INT,
    month           INT,
    month_name      VARCHAR(20),
    week_of_year    INT,
    day_of_week     INT,
    day_name        VARCHAR(20),
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN
);

CREATE TABLE dim_customer (
    customer_sk     INT AUTOINCREMENT PRIMARY KEY,
    customer_id     VARCHAR(50) NOT NULL,
    customer_name   VARCHAR(200),
    email           VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    region          VARCHAR(50),
    segment         VARCHAR(50),
    effective_from  TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
    effective_to    TIMESTAMP_NTZ DEFAULT '9999-12-31',
    is_current      BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_product (
    product_sk      INT AUTOINCREMENT PRIMARY KEY,
    product_id      VARCHAR(50) NOT NULL,
    product_name    VARCHAR(300),
    category        VARCHAR(100),
    subcategory     VARCHAR(100),
    brand           VARCHAR(100),
    unit_cost       DECIMAL(10,2),
    is_current      BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_store (
    store_sk        INT AUTOINCREMENT PRIMARY KEY,
    store_id        VARCHAR(50) NOT NULL,
    store_name      VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    region          VARCHAR(50),
    store_type      VARCHAR(50),
    open_date       DATE
);

SQL - Fact table
CREATE TABLE fct_sales (
    sale_sk         INT AUTOINCREMENT PRIMARY KEY,
    -- Foreign keys to dimensions
    date_sk         INT NOT NULL REFERENCES dim_date(date_sk),
    customer_sk     INT NOT NULL REFERENCES dim_customer(customer_sk),
    product_sk      INT NOT NULL REFERENCES dim_product(product_sk),
    store_sk        INT NOT NULL REFERENCES dim_store(store_sk),
    -- Degenerate dimension
    order_number    VARCHAR(50),
    -- Measures
    quantity        INT,
    unit_price      DECIMAL(10,2),
    discount_pct    DECIMAL(5,2),
    line_total      DECIMAL(12,2),
    -- Metadata
    loaded_at       TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

Building with dbt: The Layer Pattern


The standard dbt project structure for a star schema uses three layers. This is the pattern we use on every project, and it maps directly to how dbt works with Snowflake:



Testing Your Model with dbt


Every star schema should have these dbt tests at minimum:



Common Mistakes



Key Takeaways


Frequently Asked Questions

Q: Do star schemas still make sense in Snowflake?

Yes. Snowflake's columnar storage and micro-partition pruning make star schemas perform exceptionally well. Denormalized fact tables with dimension foreign keys allow the query optimizer to prune irrelevant partitions efficiently. Clean dimensional models are also easier for analysts to understand and query.

Q: Does Snowflake need indexes for star schema performance?

No. Snowflake doesn't support traditional indexes. Instead, it uses automatic micro-partition pruning based on column min/max metadata. For very large tables (1TB+), you can add clustering keys on frequently filtered columns, but most star schemas perform well without them.

Q: Should I use SCD Type 1 or Type 2 for dimension tables?

Use Type 1 (overwrite) when you don't need historical tracking and want simplicity. Use Type 2 (versioned rows with effective dates) when analysts need to see what a dimension value was at the time of a transaction. Many teams use Type 1 for most dimensions and Type 2 only for business-critical ones.

Q: How do I handle late-arriving facts in a star schema?

Use MERGE (upsert) with a unique key so late-arriving facts get inserted without creating duplicates. In dbt, use the incremental materialization with a unique_key configuration to handle this pattern automatically.

Pranay Vatsal, Founder & CEO

Pranay Vatsal is the Founder & CEO of CelestInfo with deep expertise in Snowflake, data architecture, and building production-grade data systems for global enterprises.

Related Articles

Burning Questions
About CelestInfo

Simple answers to make things clear.

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

Insights are updated in real-time as new data becomes available.

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.