Building a Data Lake on S3: Architecture Patterns That Scale
Last updated: April 2025
Quick answer: A well-designed S3 data lake uses three zones (raw/curated/consumption), Hive-style partitioning by date, Parquet files sized between 128MB and 1GB, the AWS Glue Data Catalog for metadata, encryption at rest with KMS, and lifecycle policies to move cold data to Glacier. The biggest mistake teams make is dumping millions of tiny files with no partitioning and no catalog.
The Three-Zone Architecture
Every S3 data lake we've built follows the same three-zone pattern. It's not original, but it works at scale, and deviating from it creates messes that are expensive to clean up later.
- Raw zone (landing/bronze): Data arrives exactly as the source system produces it. No transformations, no type casting, no deduplication. CSV from SFTP? Lands here as CSV. JSON from a webhook? Lands here as JSON. This is your audit trail. You can always reprocess from raw if anything downstream breaks.
- Curated zone (processed/silver): Data has been cleaned, deduplicated, typed, and converted to Parquet. This is where schema enforcement happens. Column names are standardized, timestamps are in UTC, and null handling follows a consistent policy. Most ETL jobs read from raw and write to curated.
- Consumption zone (analytics/gold): Aggregated, denormalized tables optimized for specific analytical workloads. Think pre-computed KPIs, dimensional models, and ML feature stores. Athena, Redshift Spectrum, and Snowflake external tables point at this zone.
s3://company-datalake-raw/
orders/
year=2026/month=02/day=15/
orders_20260215_001.json
orders_20260215_002.json
clickstream/
year=2026/month=02/day=15/hour=14/
clicks_20260215_140000.json.gz
s3://company-datalake-curated/
orders/
year=2026/month=02/day=15/
part-00000.snappy.parquet
part-00001.snappy.parquet
clickstream/
year=2026/month=02/day=15/
part-00000.snappy.parquet
s3://company-datalake-consumption/
daily_revenue/
year=2026/month=02/
daily_revenue_202602.snappy.parquet
customer_360/
customer_360_latest.snappy.parquet
Partitioning Strategy: Get This Right Early
Partitioning determines how query engines like Athena and Spark find your data. Get it wrong and your queries scan everything. Use Hive-style partitioning (year=2026/month=02/day=15/) because it's recognized automatically by AWS Glue, Athena, Spark, and Presto.
Partition by how people query the data. If 90% of queries filter on date, partition by year/month/day. If queries also filter by region, add a region partition: year=2026/month=02/region=us-east/. But don't over-partition. If you partition by year/month/day/hour/minute, you'll end up with millions of directories each containing a single tiny file. That's worse than no partitioning at all.
A good rule of thumb: each partition should contain at least a few hundred megabytes of data. If your daily partition only has 5MB, consider partitioning by month instead.
File Format Choices: Parquet, Avro, or JSON?
This isn't a matter of preference - different formats serve different purposes:
- Parquet: Columnar storage, excellent compression (typically 5-10x vs CSV), supports predicate pushdown. Use this for your curated and consumption zones. Athena queries against Parquet are 3-5x cheaper than the same query against CSV because less data gets scanned. Pair it with Snappy compression for the best read-performance-to-compression ratio.
- Avro: Row-based, schema embedded in each file, excellent for schema evolution. Use this for streaming ingestion (Kafka, Kinesis) where records arrive one at a time and you need fast writes. You can always convert Avro to Parquet in the curated zone.
- JSON: Human-readable, flexible schema. Fine for the raw zone as a landing format. Terrible for analytics at scale - no columnar pruning, poor compression, slow to parse. If your raw data is JSON, convert to Parquet in the curated zone.
- CSV: Avoid for anything beyond small reference files. No schema information, no compression by default, quoting and delimiter issues are a constant headache. If a source system gives you CSV, land it in raw and immediately convert to Parquet.
File Sizing: The 128MB-1GB Sweet Spot
File size matters more than most teams realize. Each file you store in S3 requires one API call to list, one to open, and one metadata entry in the Glue catalog. If you've got 10 million files at 100KB each, an Athena query has to open 10 million files. That's slow and expensive.
Aim for Parquet files between 128MB and 1GB. If your ETL produces smaller files (common with streaming pipelines), run a compaction job that combines small files into larger ones. Spark's coalesce() or repartition() works well for this. Schedule compaction as a nightly job against the curated zone.
Metadata Management: Glue Data Catalog vs Hive Metastore
A data lake without a catalog is just a folder full of files. You need something that tracks table schemas, partition locations, and file formats.
AWS Glue Data Catalog is the default choice if you're on AWS. It's serverless, integrates natively with Athena, Redshift Spectrum, and EMR, and supports automatic schema discovery via Glue Crawlers. The first million objects stored are free; after that, it's $1 per 100,000 objects per month. For most data lakes, the catalog cost is negligible.
Apache Hive Metastore makes sense if you're running self-managed Spark or need multi-cloud portability. It runs on a relational database (usually MySQL or PostgreSQL) and requires you to manage uptime, backups, and scaling. Most AWS-native teams don't need it.
One gotcha with Glue Crawlers: they're great for initial discovery but can be unpredictable with schema evolution. If a new column appears in the data, the crawler might create a new table version or merge it incorrectly. For production pipelines, register partitions explicitly with ALTER TABLE ADD PARTITION in Athena instead of relying on crawlers.
Security: Bucket Policies + IAM + KMS
Data lake security has three layers, and you need all three:
- S3 bucket policies: Control which AWS accounts and services can access the bucket. Block all public access (this should be the default, but verify it). Restrict cross-account access to specific IAM roles.
- IAM roles and policies: Grant least-privilege access. The ETL pipeline gets read/write access to raw and curated zones. Analysts get read-only access to the curated and consumption zones. Nobody gets
s3:*on the entire bucket. - Encryption at rest with KMS: Enable SSE-KMS on all data lake buckets. Use customer-managed keys (CMKs) if your compliance requirements mandate key rotation control. SSE-S3 is simpler but gives you less control over key management.
For cross-account access (common in multi-team organizations), use S3 bucket policies combined with IAM role assumption. The consuming account assumes a role in the data lake account, and that role has specific S3 permissions. Don't share access keys - use role-based access exclusively.
Lifecycle Policies: Automate Cost Management
S3 Standard storage is $0.023/GB/month. Glacier is $0.004/GB/month. That's an 80% cost reduction for data you rarely access. Set up lifecycle policies to move data automatically:
- Raw zone data older than 90 days → S3 Glacier Instant Retrieval
- Raw zone data older than 1 year → S3 Glacier Deep Archive
- Temp/staging data older than 7 days → Delete
- Curated zone data older than 6 months → S3 Infrequent Access
One thing people miss: lifecycle policies apply to the S3 storage class, not to the Glue catalog. Your Athena table definitions still point to the data even after it moves to Glacier. If someone runs a query against archived data, they'll get an error. Update your table definitions or use separate tables for hot vs. cold data.
Schema Registry: Plan for Schema Evolution
Source systems change their schemas. A new column gets added, a field gets renamed, a type changes from integer to string. If you don't plan for this, your ETL pipeline breaks at 3am and your morning dashboards are empty.
Use the AWS Glue Schema Registry (or Confluent Schema Registry if you're using Kafka) to track schema versions. Enforce backward compatibility: new fields can be added, but existing fields can't be removed or have their types changed. Parquet handles additive schema changes well - new columns get null values in old files, old columns stay intact in new files.
Anti-Patterns: What Not to Do
We've seen every data lake mistake in the book. Here are the ones that cause the most pain:
- The small file problem: Millions of files under 1MB each. Usually caused by streaming pipelines that write one file per micro-batch. Athena queries against 5 million small files take 10x longer than the same data in 50 properly-sized Parquet files. Fix it with a compaction job.
- No partitioning: Dumping all files into a flat directory. Every query does a full scan. Partitioning by date typically eliminates 90%+ of unnecessary data scanning.
- Using CSV for everything: No compression, no columnar pruning, delimiter hell. We've seen a data lake where converting from gzipped CSV to Snappy Parquet reduced storage by 70% and query costs by 85%.
- No data catalog: If the only way to know what's in your data lake is to read the S3 path and guess, you've got a data swamp, not a data lake. Register everything in Glue Data Catalog from day one.
- Giving everyone s3:* permissions: One accidental
aws s3 rm --recursiveand your entire curated zone is gone. Use IAM policies with least-privilege access and enable S3 versioning on critical buckets.
Conclusion
A well-architected S3 data lake isn't complicated, but it does require deliberate design decisions upfront. The three-zone structure, Hive-style partitioning, Parquet format, and Glue Data Catalog form the foundation. Layer on KMS encryption, lifecycle policies, and a schema registry, and you've got a data lake that scales to petabytes without becoming unmanageable. The teams that struggle are the ones that skip these decisions early and try to retrofit structure onto a data swamp 18 months later. Don't be that team.
