Working with JSON, Parquet, and Avro in Snowflake: The VARIANT Column Guide

Pranay Vatsal

Published Mar 20, 2025 · Last updated Feb 2026

Celestinfo Software Solutions Pvt. Ltd. • Mar 20, 2025

Last updated: April 2025

Quick answer: Snowflake's VARIANT data type stores JSON, Parquet, and Avro data natively without requiring a predefined schema. You query it with dot notation, bracket notation, and FLATTEN for arrays. Snowflake auto-detects repeated paths and stores them in a columnar-optimized format, so performance is closer to relational queries than you'd expect.

The VARIANT Data Type

VARIANT is Snowflake's universal container for semi-structured data. A single VARIANT column can hold a JSON object, a nested array, or any combination of types. Internally, Snowflake doesn't store it as a raw JSON string. It parses the structure at load time, detects frequently repeated key paths, and stores those paths in a columnar format - the same way it stores relational columns. Less-common paths get stored in a flexible row-based format.

This hybrid approach means that querying a well-structured JSON document's top-level keys is nearly as fast as querying regular columns. Deeply nested, rarely-accessed paths are slower, but they're still queryable without any schema definition upfront.

Loading Semi-Structured Data

The standard approach is COPY INTO with a file format that matches your source files. Here's how to load each format:

Loading JSON

SQL

-- Create a file format for JSON
CREATE OR REPLACE FILE FORMAT json_format
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE
  STRIP_NULL_VALUES = TRUE;

-- Create a table with a VARIANT column
CREATE OR REPLACE TABLE raw.api_responses (
  loaded_at  TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
  payload    VARIANT
);

-- Load from a stage
COPY INTO raw.api_responses (payload)
FROM @my_s3_stage/api_data/
FILE_FORMAT = json_format;

STRIP_OUTER_ARRAY = TRUE is important when your JSON file contains an array of objects at the top level. Without it, Snowflake loads the entire array as a single row. With it, each array element becomes its own row.

Loading Parquet

SQL

CREATE OR REPLACE FILE FORMAT parquet_format
  TYPE = 'PARQUET';

COPY INTO raw.api_responses (payload)
FROM @my_s3_stage/events/
FILE_FORMAT = parquet_format;

Parquet files already carry schema metadata. Snowflake reads it automatically and maps all columns into the VARIANT structure. You can also load Parquet directly into typed columns using INFER_SCHEMA (covered below).

Loading Avro

SQL

CREATE OR REPLACE FILE FORMAT avro_format
  TYPE = 'AVRO';

COPY INTO raw.api_responses (payload)
FROM @my_s3_stage/kafka_events/
FILE_FORMAT = avro_format;

Querying VARIANT Columns

You've got three ways to pull values out of a VARIANT column:

Dot Notation

SQL

SELECT
  payload:order_id::INT AS order_id,
  payload:customer.name::STRING AS customer_name,
  payload:customer.email::STRING AS customer_email,
  payload:total_amount::NUMBER(10,2) AS total
FROM raw.api_responses;

Dot notation uses the colon (:) to access the top-level key, then dots for nested keys. The ::TYPE suffix casts the VARIANT value to a native Snowflake type. Without the cast, everything comes back as VARIANT, which breaks comparisons and aggregations.

Bracket Notation

SQL

-- Useful when key names have special characters or spaces
SELECT
  payload['order-id']::INT AS order_id,
  payload['line items'][0]['product_name']::STRING AS first_product
FROM raw.api_responses;

Bracket notation works for keys containing hyphens, spaces, or other characters that dot notation can't handle. It also gives you array index access with [0], [1], etc.

FLATTEN for Arrays

This is where most people hit their first wall. If your JSON contains an array - say, a list of line items in an order - you need FLATTEN to explode it into rows.

SQL - Flattening an Array of Line Items

SELECT
  payload:order_id::INT AS order_id,
  item.value:product_id::INT AS product_id,
  item.value:product_name::STRING AS product_name,
  item.value:quantity::INT AS quantity,
  item.value:unit_price::NUMBER(10,2) AS unit_price
FROM raw.api_responses,
  LATERAL FLATTEN(input => payload:line_items) item;

LATERAL FLATTEN takes an array-valued expression and produces one row per element. The item.value reference gives you the individual JSON object for each array element. If a row has 5 line items, you get 5 output rows.

Nested FLATTEN

For deeply nested structures - an array inside an array - you chain FLATTEN calls:

SQL - Nested FLATTEN

SELECT
  payload:order_id::INT AS order_id,
  item.value:product_name::STRING AS product,
  tag.value::STRING AS tag
FROM raw.api_responses,
  LATERAL FLATTEN(input => payload:line_items) item,
  LATERAL FLATTEN(input => item.value:tags) tag;

Handling Null vs Missing Keys

There's a subtle but important difference. A JSON key with a null value ("email": null) returns a VARIANT null. A missing key (the key doesn't exist in the document at all) also returns null. You can't distinguish between them with a simple IS NULL check.

To check if a key actually exists in the document:

SQL

-- This returns TRUE if the key exists, even with a null value
SELECT ARRAY_CONTAINS('email'::VARIANT, OBJECT_KEYS(payload))
FROM raw.api_responses;

Schema Detection with INFER_SCHEMA

Instead of guessing what's inside your Parquet or Avro files, use INFER_SCHEMA to have Snowflake read the file metadata and return the column definitions:

SQL

SELECT *
FROM TABLE(
  INFER_SCHEMA(
    LOCATION => '@my_s3_stage/events/',
    FILE_FORMAT => 'parquet_format'
  )
);

-- Use the result to auto-create a typed table
CREATE OR REPLACE TABLE raw.events
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
    FROM TABLE(INFER_SCHEMA(
      LOCATION => '@my_s3_stage/events/',
      FILE_FORMAT => 'parquet_format'
    ))
  );

This is particularly valuable for Parquet and Avro files that carry embedded schemas. For JSON, INFER_SCHEMA samples the files and infers types from the data - useful but less reliable if your JSON has inconsistent types across documents.

Creating Views Over Semi-Structured Data

Analysts shouldn't need to write dot notation and FLATTEN queries. Build views that present VARIANT data as clean relational columns:

SQL

CREATE OR REPLACE VIEW analytics.v_orders AS
SELECT
  payload:order_id::INT AS order_id,
  payload:customer.name::STRING AS customer_name,
  payload:order_date::DATE AS order_date,
  payload:total_amount::NUMBER(10,2) AS total_amount,
  payload:status::STRING AS status
FROM raw.api_responses
WHERE payload:order_id IS NOT NULL;

Analysts query the view like any other table. Schema changes in the JSON don't break the view - new fields simply aren't exposed until you update the view definition. Removed fields return null.

When to Use VARIANT vs Flattened Relational Columns

Use VARIANT when...	Use relational columns when...
Schema changes frequently (new fields arrive weekly)	Schema is stable and well-defined
You need to preserve the original document exactly	Query performance on specific fields is critical
Prototyping or exploring new data sources	Analysts need simple, familiar SQL access
Documents have highly variable structures	You need to enforce data types and constraints

A common hybrid approach: land data into a VARIANT column for raw storage, then create a downstream table or view with explicit relational columns for analytics. You get the flexibility of schema-on-read and the performance of schema-on-write.

Performance Considerations and Gotchas

16 MB per-row limit: A single VARIANT value can't exceed 16 MB compressed. If your API returns a massive JSON array in one response, use STRIP_OUTER_ARRAY to split it into individual rows at load time.
Always cast VARIANT values: Comparison operators on raw VARIANT values produce unexpected results. WHERE payload:amount > 100 does a string comparison, not numeric. Use WHERE payload:amount::NUMBER > 100.
Deep nesting is slow without FLATTEN: Accessing payload:a.b.c.d.e.f 6 levels deep works but forces Snowflake to traverse the structure for every row. For frequently queried deep paths, extract them into materialized columns.
Clustering on VARIANT paths: You can cluster a table on a VARIANT expression like payload:customer_id::INT. This dramatically improves query performance when you filter on that path frequently.

Key Takeaways

VARIANT stores JSON, Parquet, and Avro data natively. Snowflake optimizes frequently-accessed paths into columnar storage automatically.
Use dot notation for simple access, bracket notation for special characters, and LATERAL FLATTEN for arrays.
Always cast VARIANT values to native types (::STRING, ::NUMBER, ::BOOLEAN) before filtering or aggregating.
INFER_SCHEMA reads file metadata to auto-generate table definitions - invaluable for Parquet and Avro.
Build views over VARIANT columns to give analysts clean relational access without requiring them to learn semi-structured syntax.
Watch the 16 MB per-row limit, and remember that null values and missing keys are indistinguishable without explicit key existence checks.

Frequently Asked Questions

Q: What is the VARIANT data type in Snowflake?

VARIANT is a universal semi-structured data type that can hold JSON, Avro, ORC, Parquet, or XML data natively. Snowflake stores VARIANT data in an optimized columnar format, automatically detecting repeated paths for efficient query performance.

Q: What is the maximum size of a VARIANT value in Snowflake?

A single VARIANT value can hold up to 16 MB of compressed data per row. If your JSON documents exceed this limit, you need to split them before loading or use STRIP_OUTER_ARRAY to load array elements as individual rows.

Q: When should I use VARIANT vs flattened relational columns?

Use VARIANT when your schema changes frequently, when you need to preserve the original document structure, or when you're prototyping. Use flattened relational columns when query performance matters, when analysts need simple SQL access, or when the schema is stable and well-understood.

Pranay Vatsal, Founder & CEO

Pranay Vatsal is the Founder & CEO of CelestInfo with deep expertise in Snowflake, data architecture, and building production-grade data systems for global enterprises.

Burning Questions
About CelestInfo

Simple answers to make things clear.

How accurate are the AI insights?+

Our AI insights are continuously trained on large datasets and validated by experts to ensure high accuracy.

Can I integrate with my existing tools?+

Absolutely. CelestInfo supports integration with a wide range of industry-standard software and tools.

What security measures do you have?+

We implement enterprise-grade encryption, access controls, and regular audits to ensure your data is safe.

How often are insights updated?+

Insights are updated in real-time as new data becomes available.

What kind of support do you offer?+

We offer 24/7 support via chat, email, and dedicated account managers.

Still have questions?

Working with JSON, Parquet, and Avro in Snowflake: The VARIANT Column Guide

The VARIANT Data Type

Loading Semi-Structured Data

Loading JSON

Loading Parquet

Loading Avro

Querying VARIANT Columns

Dot Notation

Bracket Notation

FLATTEN for Arrays

Nested FLATTEN

Handling Null vs Missing Keys

Schema Detection with INFER_SCHEMA

Creating Views Over Semi-Structured Data

When to Use VARIANT vs Flattened Relational Columns

Performance Considerations and Gotchas

Key Takeaways

Frequently Asked Questions

Q: What is the VARIANT data type in Snowflake?

Q: What is the maximum size of a VARIANT value in Snowflake?

Q: When should I use VARIANT vs flattened relational columns?

Related Articles

Burning Questions
About CelestInfo

Ready? Let's Talk!

Working with JSON, Parquet, and Avro in Snowflake: The VARIANT Column Guide

The VARIANT Data Type

Loading Semi-Structured Data

Loading JSON

Loading Parquet

Loading Avro

Querying VARIANT Columns

Dot Notation

Bracket Notation

FLATTEN for Arrays

Nested FLATTEN

Handling Null vs Missing Keys

Schema Detection with INFER_SCHEMA

Creating Views Over Semi-Structured Data

When to Use VARIANT vs Flattened Relational Columns

Performance Considerations and Gotchas

Key Takeaways

Frequently Asked Questions

Q: What is the VARIANT data type in Snowflake?

Q: What is the maximum size of a VARIANT value in Snowflake?

Q: When should I use VARIANT vs flattened relational columns?

Related Articles

Burning QuestionsAbout CelestInfo

Burning Questions
About CelestInfo