Working with JSON, Parquet, and Avro in Snowflake: The VARIANT Column Guide
Last updated: April 2025
Quick answer: Snowflake's VARIANT data type stores JSON, Parquet, and Avro data natively without requiring a predefined schema. You query it with dot notation, bracket notation, and FLATTEN for arrays. Snowflake auto-detects repeated paths and stores them in a columnar-optimized format, so performance is closer to relational queries than you'd expect.
The VARIANT Data Type
VARIANT is Snowflake's universal container for semi-structured data. A single VARIANT column can hold a JSON object, a nested array, or any combination of types. Internally, Snowflake doesn't store it as a raw JSON string. It parses the structure at load time, detects frequently repeated key paths, and stores those paths in a columnar format - the same way it stores relational columns. Less-common paths get stored in a flexible row-based format.
This hybrid approach means that querying a well-structured JSON document's top-level keys is nearly as fast as querying regular columns. Deeply nested, rarely-accessed paths are slower, but they're still queryable without any schema definition upfront.
Loading Semi-Structured Data
The standard approach is COPY INTO with a file format that matches your source files. Here's how to load each format:
Loading JSON
-- Create a file format for JSON CREATE OR REPLACE FILE FORMAT json_format TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE STRIP_NULL_VALUES = TRUE; -- Create a table with a VARIANT column CREATE OR REPLACE TABLE raw.api_responses ( loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), payload VARIANT ); -- Load from a stage COPY INTO raw.api_responses (payload) FROM @my_s3_stage/api_data/ FILE_FORMAT = json_format;
STRIP_OUTER_ARRAY = TRUE is important when your JSON file contains an array of objects at the top level. Without it, Snowflake loads the entire array as a single row. With it, each array element becomes its own row.
Loading Parquet
CREATE OR REPLACE FILE FORMAT parquet_format TYPE = 'PARQUET'; COPY INTO raw.api_responses (payload) FROM @my_s3_stage/events/ FILE_FORMAT = parquet_format;
Parquet files already carry schema metadata. Snowflake reads it automatically and maps all columns into the VARIANT structure. You can also load Parquet directly into typed columns using INFER_SCHEMA (covered below).
Loading Avro
CREATE OR REPLACE FILE FORMAT avro_format TYPE = 'AVRO'; COPY INTO raw.api_responses (payload) FROM @my_s3_stage/kafka_events/ FILE_FORMAT = avro_format;
Querying VARIANT Columns
You've got three ways to pull values out of a VARIANT column:
Dot Notation
SELECT payload:order_id::INT AS order_id, payload:customer.name::STRING AS customer_name, payload:customer.email::STRING AS customer_email, payload:total_amount::NUMBER(10,2) AS total FROM raw.api_responses;
Dot notation uses the colon (:) to access the top-level key, then dots for nested keys. The ::TYPE suffix casts the VARIANT value to a native Snowflake type. Without the cast, everything comes back as VARIANT, which breaks comparisons and aggregations.
Bracket Notation
-- Useful when key names have special characters or spaces SELECT payload['order-id']::INT AS order_id, payload['line items'][0]['product_name']::STRING AS first_product FROM raw.api_responses;
Bracket notation works for keys containing hyphens, spaces, or other characters that dot notation can't handle. It also gives you array index access with [0], [1], etc.
FLATTEN for Arrays
This is where most people hit their first wall. If your JSON contains an array - say, a list of line items in an order - you need FLATTEN to explode it into rows.
SELECT payload:order_id::INT AS order_id, item.value:product_id::INT AS product_id, item.value:product_name::STRING AS product_name, item.value:quantity::INT AS quantity, item.value:unit_price::NUMBER(10,2) AS unit_price FROM raw.api_responses, LATERAL FLATTEN(input => payload:line_items) item;
LATERAL FLATTEN takes an array-valued expression and produces one row per element. The item.value reference gives you the individual JSON object for each array element. If a row has 5 line items, you get 5 output rows.
Nested FLATTEN
For deeply nested structures - an array inside an array - you chain FLATTEN calls:
SELECT payload:order_id::INT AS order_id, item.value:product_name::STRING AS product, tag.value::STRING AS tag FROM raw.api_responses, LATERAL FLATTEN(input => payload:line_items) item, LATERAL FLATTEN(input => item.value:tags) tag;
Handling Null vs Missing Keys
There's a subtle but important difference. A JSON key with a null value ("email": null) returns a VARIANT null. A missing key (the key doesn't exist in the document at all) also returns null. You can't distinguish between them with a simple IS NULL check.
To check if a key actually exists in the document:
-- This returns TRUE if the key exists, even with a null value
SELECT ARRAY_CONTAINS('email'::VARIANT, OBJECT_KEYS(payload))
FROM raw.api_responses;
Schema Detection with INFER_SCHEMA
Instead of guessing what's inside your Parquet or Avro files, use INFER_SCHEMA to have Snowflake read the file metadata and return the column definitions:
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@my_s3_stage/events/',
FILE_FORMAT => 'parquet_format'
)
);
-- Use the result to auto-create a typed table
CREATE OR REPLACE TABLE raw.events
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(INFER_SCHEMA(
LOCATION => '@my_s3_stage/events/',
FILE_FORMAT => 'parquet_format'
))
);
This is particularly valuable for Parquet and Avro files that carry embedded schemas. For JSON, INFER_SCHEMA samples the files and infers types from the data - useful but less reliable if your JSON has inconsistent types across documents.
Creating Views Over Semi-Structured Data
Analysts shouldn't need to write dot notation and FLATTEN queries. Build views that present VARIANT data as clean relational columns:
CREATE OR REPLACE VIEW analytics.v_orders AS SELECT payload:order_id::INT AS order_id, payload:customer.name::STRING AS customer_name, payload:order_date::DATE AS order_date, payload:total_amount::NUMBER(10,2) AS total_amount, payload:status::STRING AS status FROM raw.api_responses WHERE payload:order_id IS NOT NULL;
Analysts query the view like any other table. Schema changes in the JSON don't break the view - new fields simply aren't exposed until you update the view definition. Removed fields return null.
When to Use VARIANT vs Flattened Relational Columns
| Use VARIANT when... | Use relational columns when... |
|---|---|
| Schema changes frequently (new fields arrive weekly) | Schema is stable and well-defined |
| You need to preserve the original document exactly | Query performance on specific fields is critical |
| Prototyping or exploring new data sources | Analysts need simple, familiar SQL access |
| Documents have highly variable structures | You need to enforce data types and constraints |
A common hybrid approach: land data into a VARIANT column for raw storage, then create a downstream table or view with explicit relational columns for analytics. You get the flexibility of schema-on-read and the performance of schema-on-write.
Performance Considerations and Gotchas
- 16 MB per-row limit: A single VARIANT value can't exceed 16 MB compressed. If your API returns a massive JSON array in one response, use
STRIP_OUTER_ARRAYto split it into individual rows at load time. - Always cast VARIANT values: Comparison operators on raw VARIANT values produce unexpected results.
WHERE payload:amount > 100does a string comparison, not numeric. UseWHERE payload:amount::NUMBER > 100. - Deep nesting is slow without FLATTEN: Accessing
payload:a.b.c.d.e.f6 levels deep works but forces Snowflake to traverse the structure for every row. For frequently queried deep paths, extract them into materialized columns. - Clustering on VARIANT paths: You can cluster a table on a VARIANT expression like
payload:customer_id::INT. This dramatically improves query performance when you filter on that path frequently.
Key Takeaways
- VARIANT stores JSON, Parquet, and Avro data natively. Snowflake optimizes frequently-accessed paths into columnar storage automatically.
- Use dot notation for simple access, bracket notation for special characters, and LATERAL FLATTEN for arrays.
- Always cast VARIANT values to native types (
::STRING,::NUMBER,::BOOLEAN) before filtering or aggregating. INFER_SCHEMAreads file metadata to auto-generate table definitions - invaluable for Parquet and Avro.- Build views over VARIANT columns to give analysts clean relational access without requiring them to learn semi-structured syntax.
- Watch the 16 MB per-row limit, and remember that null values and missing keys are indistinguishable without explicit key existence checks.
Frequently Asked Questions
Q: What is the VARIANT data type in Snowflake?
VARIANT is a universal semi-structured data type that can hold JSON, Avro, ORC, Parquet, or XML data natively. Snowflake stores VARIANT data in an optimized columnar format, automatically detecting repeated paths for efficient query performance.
Q: What is the maximum size of a VARIANT value in Snowflake?
A single VARIANT value can hold up to 16 MB of compressed data per row. If your JSON documents exceed this limit, you need to split them before loading or use STRIP_OUTER_ARRAY to load array elements as individual rows.
Q: When should I use VARIANT vs flattened relational columns?
Use VARIANT when your schema changes frequently, when you need to preserve the original document structure, or when you're prototyping. Use flattened relational columns when query performance matters, when analysts need simple SQL access, or when the schema is stable and well-understood.
