Building a Data Lake on S3: Architecture Patterns That Scale

Celestinfo Software Solutions Pvt. Ltd. Apr 03, 2025

Last updated: April 2025

Quick answer: A well-designed S3 data lake uses three zones (raw/curated/consumption), Hive-style partitioning by date, Parquet files sized between 128MB and 1GB, the AWS Glue Data Catalog for metadata, encryption at rest with KMS, and lifecycle policies to move cold data to Glacier. The biggest mistake teams make is dumping millions of tiny files with no partitioning and no catalog.

The Three-Zone Architecture

Every S3 data lake we've built follows the same three-zone pattern. It's not original, but it works at scale, and deviating from it creates messes that are expensive to clean up later.

S3 bucket structure example
s3://company-datalake-raw/
  orders/
    year=2026/month=02/day=15/
      orders_20260215_001.json
      orders_20260215_002.json
  clickstream/
    year=2026/month=02/day=15/hour=14/
      clicks_20260215_140000.json.gz

s3://company-datalake-curated/
  orders/
    year=2026/month=02/day=15/
      part-00000.snappy.parquet
      part-00001.snappy.parquet
  clickstream/
    year=2026/month=02/day=15/
      part-00000.snappy.parquet

s3://company-datalake-consumption/
  daily_revenue/
    year=2026/month=02/
      daily_revenue_202602.snappy.parquet
  customer_360/
      customer_360_latest.snappy.parquet

Partitioning Strategy: Get This Right Early

Partitioning determines how query engines like Athena and Spark find your data. Get it wrong and your queries scan everything. Use Hive-style partitioning (year=2026/month=02/day=15/) because it's recognized automatically by AWS Glue, Athena, Spark, and Presto.

Partition by how people query the data. If 90% of queries filter on date, partition by year/month/day. If queries also filter by region, add a region partition: year=2026/month=02/region=us-east/. But don't over-partition. If you partition by year/month/day/hour/minute, you'll end up with millions of directories each containing a single tiny file. That's worse than no partitioning at all.

A good rule of thumb: each partition should contain at least a few hundred megabytes of data. If your daily partition only has 5MB, consider partitioning by month instead.

File Format Choices: Parquet, Avro, or JSON?

This isn't a matter of preference - different formats serve different purposes:

File Sizing: The 128MB-1GB Sweet Spot

File size matters more than most teams realize. Each file you store in S3 requires one API call to list, one to open, and one metadata entry in the Glue catalog. If you've got 10 million files at 100KB each, an Athena query has to open 10 million files. That's slow and expensive.

Aim for Parquet files between 128MB and 1GB. If your ETL produces smaller files (common with streaming pipelines), run a compaction job that combines small files into larger ones. Spark's coalesce() or repartition() works well for this. Schedule compaction as a nightly job against the curated zone.

Metadata Management: Glue Data Catalog vs Hive Metastore

A data lake without a catalog is just a folder full of files. You need something that tracks table schemas, partition locations, and file formats.

AWS Glue Data Catalog is the default choice if you're on AWS. It's serverless, integrates natively with Athena, Redshift Spectrum, and EMR, and supports automatic schema discovery via Glue Crawlers. The first million objects stored are free; after that, it's $1 per 100,000 objects per month. For most data lakes, the catalog cost is negligible.

Apache Hive Metastore makes sense if you're running self-managed Spark or need multi-cloud portability. It runs on a relational database (usually MySQL or PostgreSQL) and requires you to manage uptime, backups, and scaling. Most AWS-native teams don't need it.

One gotcha with Glue Crawlers: they're great for initial discovery but can be unpredictable with schema evolution. If a new column appears in the data, the crawler might create a new table version or merge it incorrectly. For production pipelines, register partitions explicitly with ALTER TABLE ADD PARTITION in Athena instead of relying on crawlers.

Security: Bucket Policies + IAM + KMS

Data lake security has three layers, and you need all three:

For cross-account access (common in multi-team organizations), use S3 bucket policies combined with IAM role assumption. The consuming account assumes a role in the data lake account, and that role has specific S3 permissions. Don't share access keys - use role-based access exclusively.

Lifecycle Policies: Automate Cost Management

S3 Standard storage is $0.023/GB/month. Glacier is $0.004/GB/month. That's an 80% cost reduction for data you rarely access. Set up lifecycle policies to move data automatically:

One thing people miss: lifecycle policies apply to the S3 storage class, not to the Glue catalog. Your Athena table definitions still point to the data even after it moves to Glacier. If someone runs a query against archived data, they'll get an error. Update your table definitions or use separate tables for hot vs. cold data.

Schema Registry: Plan for Schema Evolution

Source systems change their schemas. A new column gets added, a field gets renamed, a type changes from integer to string. If you don't plan for this, your ETL pipeline breaks at 3am and your morning dashboards are empty.

Use the AWS Glue Schema Registry (or Confluent Schema Registry if you're using Kafka) to track schema versions. Enforce backward compatibility: new fields can be added, but existing fields can't be removed or have their types changed. Parquet handles additive schema changes well - new columns get null values in old files, old columns stay intact in new files.

Anti-Patterns: What Not to Do

We've seen every data lake mistake in the book. Here are the ones that cause the most pain:

Conclusion

A well-architected S3 data lake isn't complicated, but it does require deliberate design decisions upfront. The three-zone structure, Hive-style partitioning, Parquet format, and Glue Data Catalog form the foundation. Layer on KMS encryption, lifecycle policies, and a schema registry, and you've got a data lake that scales to petabytes without becoming unmanageable. The teams that struggle are the ones that skip these decisions early and try to retrofit structure onto a data swamp 18 months later. Don't be that team.

Chakri, Cloud Solutions Architect

Chakri is a Cloud Solutions Architect at CelestInfo with hands-on experience across AWS, Azure, GCP, and Snowflake cloud infrastructure.

Related Articles

Burning Questions
About CelestInfo

Simple answers to make things clear.

Parquet for analytics (columnar, compressed, supports predicate pushdown). Avro for streaming ingestion (row-based, fast writes, schema evolution). Avoid CSV and raw JSON for large-scale analytics.

Use Hive-style partitioning (year=2026/month=02/day=15) based on your most common query filters. Each partition should contain at least a few hundred MB of data. Over-partitioning creates the small file problem.

When your data lake has millions of files under 1MB each. Every file needs a separate API call, catalog entry, and processing overhead. Fix it with compaction jobs that merge small files into 128MB-1GB Parquet files.

Glue Crawlers work for initial discovery, but can be unpredictable with schema evolution. For production pipelines, register partitions explicitly with ALTER TABLE ADD PARTITION in Athena for better control.

Set up lifecycle policies: move raw data to Glacier after 90 days, delete temp data after 7 days, and use S3 Infrequent Access for curated data older than 6 months. Converting from CSV to Parquet alone can reduce storage by 70%.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.