Snowflake External Tables: Query Your Data Lake Without Loading It
Last updated: April 2025
Quick answer: External tables let Snowflake query files in S3, GCS, or ADLS without copying them into Snowflake storage. You define a metadata layer over the external files, and Snowflake reads them at query time. Performance is 2-5x slower than native tables, but you avoid data duplication and storage costs for rarely-accessed datasets.
What External Tables Are
A regular Snowflake table stores data in Snowflake-managed micro-partitions. An external table doesn't store any data. It's a metadata layer that tells Snowflake where to find files in an external cloud storage location (S3 bucket, GCS bucket, or ADLS container), what format they're in, and how to parse them.
When you query an external table, Snowflake reads the files from the external storage at query time. The data never gets copied into Snowflake. This makes external tables ideal for data lake architectures where data needs to be accessible from multiple compute engines - Snowflake, Databricks, Spark, Presto - without duplicating it for each one.
Setting Up: Storage Integration and External Stage
Before creating an external table, you need two things: a storage integration (for authentication) and an external stage (pointing to the file location). Here's the S3 setup:
-- Create a storage integration (ACCOUNTADMIN required)
CREATE OR REPLACE STORAGE INTEGRATION s3_data_lake_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-s3-access'
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('s3://my-data-lake/events/');
-- Get the AWS IAM info to configure the trust policy
DESC STORAGE INTEGRATION s3_data_lake_int;
CREATE OR REPLACE STAGE ext_events_stage URL = 's3://my-data-lake/events/' STORAGE_INTEGRATION = s3_data_lake_int FILE_FORMAT = (TYPE = 'PARQUET');
Creating an External Table with Partitioning
The key to usable external table performance is partitioning. If your S3 files are organized by date (e.g., s3://bucket/events/year=2026/month=02/day=27/), you can define partition columns that let Snowflake prune files at query time instead of scanning everything.
CREATE OR REPLACE EXTERNAL TABLE analytics.ext_events (
-- Partition columns derived from the file path
event_date DATE AS (
TO_DATE(
SPLIT_PART(metadata$filename, '/', 2) || '-' ||
SPLIT_PART(metadata$filename, '/', 3) || '-' ||
SPLIT_PART(metadata$filename, '/', 4),
'YYYY-MM-DD'
)
),
-- Data columns from Parquet files
event_id VARCHAR AS (value:event_id::VARCHAR),
user_id INT AS (value:user_id::INT),
event_type VARCHAR AS (value:event_type::VARCHAR),
properties VARIANT AS (value:properties::VARIANT),
event_timestamp TIMESTAMP_NTZ AS (value:event_timestamp::TIMESTAMP_NTZ)
)
PARTITION BY (event_date)
LOCATION = @ext_events_stage
AUTO_REFRESH = TRUE
FILE_FORMAT = (TYPE = 'PARQUET');
The PARTITION BY (event_date) clause tells Snowflake to use the event_date column for partition pruning. A query with WHERE event_date = '2026-02-27' only reads files from that day's folder, skipping everything else.
Auto-Refresh: Keeping Metadata Current
With AUTO_REFRESH = TRUE, Snowflake uses cloud event notifications to detect new files:
- S3: SQS (Simple Queue Service) notifications triggered by new object creation.
- GCS: Pub/Sub notifications.
- ADLS: Azure Event Grid notifications.
When a new file lands in the S3 bucket, Snowflake picks up the SQS event, updates the external table's metadata, and the file becomes queryable. This process is eventually consistent - expect a 1-2 minute delay between file arrival and query availability.
For the S3 setup, Snowflake provides the SQS queue ARN after you create the external table. You configure the S3 bucket's event notification to send to that queue:
-- Get the notification channel details DESC EXTERNAL TABLE analytics.ext_events;
If auto-refresh isn't configured, you'll need to manually refresh the metadata:
ALTER EXTERNAL TABLE analytics.ext_events REFRESH;
Performance: External vs Native Tables
External tables are fundamentally slower than native Snowflake tables. Here's why:
- No micro-partition optimization: Snowflake can't apply its columnar compression, min/max pruning, or clustering optimization to external files.
- Network latency: Every query reads data over the network from S3/GCS/ADLS. Native tables read from Snowflake's local SSD cache.
- No result caching: Snowflake's result cache doesn't apply to external table queries.
In practice, expect external table queries to run 2-5x slower than the same query on a native table. Parquet files perform better than CSV because Snowflake can push down column pruning into the file format. Properly partitioned external tables with narrow date-filtered queries can approach native performance.
Materialized Views Over External Tables
If you need better query performance on frequently-accessed external data, create a materialized view over the external table. Snowflake caches the results and automatically refreshes them when the underlying data changes:
CREATE OR REPLACE MATERIALIZED VIEW analytics.mv_daily_events AS
SELECT
event_date,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM analytics.ext_events
WHERE event_date >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY event_date, event_type;
The materialized view stores precomputed results in Snowflake's native storage. Queries against the MV run at native table speed. Snowflake auto-refreshes the MV when it detects new files in the external table, though there's a delay.
When to Use External Tables
Good Use Cases
- Data lake federation: Query data that's shared between Snowflake, Databricks, and Spark without duplicating it.
- Cost savings for cold data: Keep rarely-queried historical data in cheaper S3 storage instead of paying Snowflake storage rates.
- Compliance requirements: When data must remain in a specific storage location for regulatory reasons.
- ELT staging: Use external tables as a read-only view of landing zone files before running COPY INTO to load them into native tables.
Bad Use Cases
- High-frequency queries: If the same table gets queried 100+ times per day, the performance penalty adds up. Load it natively.
- Interactive dashboards: Dashboard users expect sub-second response times. External tables can't deliver that.
- DML requirements: You can't INSERT, UPDATE, or DELETE on external tables. If you need write access, use native tables.
- Complex joins: Joining external tables with native tables forces Snowflake to read the external data for every query execution.
The Partition Pruning Gotcha
File-level pruning on external tables only works if your partition columns are properly defined and your queries filter on them. If you define an external table without partition columns, Snowflake scans every file in the stage for every query - even if you only need yesterday's data.
The metadata refresh is eventually consistent. After a new file arrives in S3, there's a 1-2 minute window where Snowflake doesn't know about it. If you have a pipeline that writes a file and immediately queries the external table, you might miss the new data. Build in a short delay (2-3 minutes) or use a manual REFRESH before querying.
Key Takeaways
- External tables are a metadata layer over S3/GCS/ADLS files. Data stays in external storage and is read at query time.
- Performance is 2-5x slower than native tables. Use partitioning and Parquet files to minimize the gap.
- Auto-refresh uses cloud event notifications (SQS, Pub/Sub, Event Grid) and is eventually consistent with a 1-2 minute delay.
- Materialized views over external tables give you native query speed for frequently-accessed aggregations.
- Use external tables for data lake federation, cold data querying, and compliance scenarios. Don't use them for interactive dashboards or high-frequency queries.
- Always define partition columns. Without them, Snowflake scans every file for every query.
Frequently Asked Questions
Q: How fast are Snowflake external tables compared to native tables?
External tables are typically 2-5x slower than native tables for the same query, because Snowflake must read data from the external storage (S3/GCS/ADLS) rather than from its own optimized micro-partition storage. The gap narrows with Parquet files and proper partitioning.
Q: Can I run DML on Snowflake external tables?
No. External tables are read-only in Snowflake. You can SELECT from them and use them in joins, but you cannot INSERT, UPDATE, or DELETE rows. To modify the data, you must update the files in the external storage directly.
Q: How does auto-refresh work for Snowflake external tables?
Auto-refresh uses cloud event notifications (SQS for S3, GCS Pub/Sub, or Azure Event Grid) to detect new files in the external stage. When a new file lands, Snowflake automatically updates the external table's metadata. This is eventually consistent - there can be a 1-2 minute delay.
Q: When should I use external tables instead of loading data into Snowflake?
Use external tables for rarely-queried data, data lake federation across Snowflake and Spark/Databricks, cost savings when you don't want to pay for Snowflake storage, or when the data must stay in its original location for compliance or multi-tool access.
