What is the best file format for a data lake on S3?

Parquet is the best choice for most analytical workloads because it's columnar, compresses well, and supports predicate pushdown. Use Avro for streaming data that needs fast writes and schema evolution. Avoid using CSV or JSON for large analytical datasets.

How should I partition data in an S3 data lake?

Use Hive-style partitioning (year=2026/month=02/day=15) based on how your data is queried. If most queries filter by date, partition by year/month/day. Avoid over-partitioning, which creates too many small files and degrades performance.

What is the small file problem in data lakes?

The small file problem occurs when a data lake contains millions of files smaller than 1MB. Each file requires a separate S3 API call, separate metadata entry in the Glue Data Catalog, and separate processing overhead by query engines like Athena or Spark. Aim for files between 128MB and 1GB for optimal performance.

Building a Data Lake on S3: Architecture Patterns That Scale

Chakri

Published Apr 03, 2025 · Last updated Feb 2026

Celestinfo Software Solutions Pvt. Ltd. • Apr 03, 2025

Last updated: April 2025

Quick answer: A well-designed S3 data lake uses three zones (raw/curated/consumption), Hive-style partitioning by date, Parquet files sized between 128MB and 1GB, the AWS Glue Data Catalog for metadata, encryption at rest with KMS, and lifecycle policies to move cold data to Glacier. The biggest mistake teams make is dumping millions of tiny files with no partitioning and no catalog.

The Three-Zone Architecture

Every S3 data lake we've built follows the same three-zone pattern. It's not original, but it works at scale, and deviating from it creates messes that are expensive to clean up later.

Raw zone (landing/bronze): Data arrives exactly as the source system produces it. No transformations, no type casting, no deduplication. CSV from SFTP? Lands here as CSV. JSON from a webhook? Lands here as JSON. This is your audit trail. You can always reprocess from raw if anything downstream breaks.
Curated zone (processed/silver): Data has been cleaned, deduplicated, typed, and converted to Parquet. This is where schema enforcement happens. Column names are standardized, timestamps are in UTC, and null handling follows a consistent policy. Most ETL jobs read from raw and write to curated.
Consumption zone (analytics/gold): Aggregated, denormalized tables optimized for specific analytical workloads. Think pre-computed KPIs, dimensional models, and ML feature stores. Athena, Redshift Spectrum, and Snowflake external tables point at this zone.

S3 bucket structure example

s3://company-datalake-raw/
  orders/
    year=2026/month=02/day=15/
      orders_20260215_001.json
      orders_20260215_002.json
  clickstream/
    year=2026/month=02/day=15/hour=14/
      clicks_20260215_140000.json.gz

s3://company-datalake-curated/
  orders/
    year=2026/month=02/day=15/
      part-00000.snappy.parquet
      part-00001.snappy.parquet
  clickstream/
    year=2026/month=02/day=15/
      part-00000.snappy.parquet

s3://company-datalake-consumption/
  daily_revenue/
    year=2026/month=02/
      daily_revenue_202602.snappy.parquet
  customer_360/
      customer_360_latest.snappy.parquet

Partitioning Strategy: Get This Right Early

Partitioning determines how query engines like Athena and Spark find your data. Get it wrong and your queries scan everything. Use Hive-style partitioning (year=2026/month=02/day=15/) because it's recognized automatically by AWS Glue, Athena, Spark, and Presto.

Partition by how people query the data. If 90% of queries filter on date, partition by year/month/day. If queries also filter by region, add a region partition: year=2026/month=02/region=us-east/. But don't over-partition. If you partition by year/month/day/hour/minute, you'll end up with millions of directories each containing a single tiny file. That's worse than no partitioning at all.

A good rule of thumb: each partition should contain at least a few hundred megabytes of data. If your daily partition only has 5MB, consider partitioning by month instead.

File Format Choices: Parquet, Avro, or JSON?

This isn't a matter of preference - different formats serve different purposes:

Parquet: Columnar storage, excellent compression (typically 5-10x vs CSV), supports predicate pushdown. Use this for your curated and consumption zones. Athena queries against Parquet are 3-5x cheaper than the same query against CSV because less data gets scanned. Pair it with Snappy compression for the best read-performance-to-compression ratio.
Avro: Row-based, schema embedded in each file, excellent for schema evolution. Use this for streaming ingestion (Kafka, Kinesis) where records arrive one at a time and you need fast writes. You can always convert Avro to Parquet in the curated zone.
JSON: Human-readable, flexible schema. Fine for the raw zone as a landing format. Terrible for analytics at scale - no columnar pruning, poor compression, slow to parse. If your raw data is JSON, convert to Parquet in the curated zone.
CSV: Avoid for anything beyond small reference files. No schema information, no compression by default, quoting and delimiter issues are a constant headache. If a source system gives you CSV, land it in raw and immediately convert to Parquet.

File Sizing: The 128MB-1GB Sweet Spot

File size matters more than most teams realize. Each file you store in S3 requires one API call to list, one to open, and one metadata entry in the Glue catalog. If you've got 10 million files at 100KB each, an Athena query has to open 10 million files. That's slow and expensive.

Aim for Parquet files between 128MB and 1GB. If your ETL produces smaller files (common with streaming pipelines), run a compaction job that combines small files into larger ones. Spark's coalesce() or repartition() works well for this. Schedule compaction as a nightly job against the curated zone.

Metadata Management: Glue Data Catalog vs Hive Metastore

A data lake without a catalog is just a folder full of files. You need something that tracks table schemas, partition locations, and file formats.

AWS Glue Data Catalog is the default choice if you're on AWS. It's serverless, integrates natively with Athena, Redshift Spectrum, and EMR, and supports automatic schema discovery via Glue Crawlers. The first million objects stored are free; after that, it's $1 per 100,000 objects per month. For most data lakes, the catalog cost is negligible.

Apache Hive Metastore makes sense if you're running self-managed Spark or need multi-cloud portability. It runs on a relational database (usually MySQL or PostgreSQL) and requires you to manage uptime, backups, and scaling. Most AWS-native teams don't need it.

One gotcha with Glue Crawlers: they're great for initial discovery but can be unpredictable with schema evolution. If a new column appears in the data, the crawler might create a new table version or merge it incorrectly. For production pipelines, register partitions explicitly with ALTER TABLE ADD PARTITION in Athena instead of relying on crawlers.

Security: Bucket Policies + IAM + KMS

Data lake security has three layers, and you need all three:

S3 bucket policies: Control which AWS accounts and services can access the bucket. Block all public access (this should be the default, but verify it). Restrict cross-account access to specific IAM roles.
IAM roles and policies: Grant least-privilege access. The ETL pipeline gets read/write access to raw and curated zones. Analysts get read-only access to the curated and consumption zones. Nobody gets s3:* on the entire bucket.
Encryption at rest with KMS: Enable SSE-KMS on all data lake buckets. Use customer-managed keys (CMKs) if your compliance requirements mandate key rotation control. SSE-S3 is simpler but gives you less control over key management.

For cross-account access (common in multi-team organizations), use S3 bucket policies combined with IAM role assumption. The consuming account assumes a role in the data lake account, and that role has specific S3 permissions. Don't share access keys - use role-based access exclusively.

Lifecycle Policies: Automate Cost Management

S3 Standard storage is $0.023/GB/month. Glacier is $0.004/GB/month. That's an 80% cost reduction for data you rarely access. Set up lifecycle policies to move data automatically:

Raw zone data older than 90 days → S3 Glacier Instant Retrieval
Raw zone data older than 1 year → S3 Glacier Deep Archive
Temp/staging data older than 7 days → Delete
Curated zone data older than 6 months → S3 Infrequent Access

One thing people miss: lifecycle policies apply to the S3 storage class, not to the Glue catalog. Your Athena table definitions still point to the data even after it moves to Glacier. If someone runs a query against archived data, they'll get an error. Update your table definitions or use separate tables for hot vs. cold data.

Schema Registry: Plan for Schema Evolution

Source systems change their schemas. A new column gets added, a field gets renamed, a type changes from integer to string. If you don't plan for this, your ETL pipeline breaks at 3am and your morning dashboards are empty.

Use the AWS Glue Schema Registry (or Confluent Schema Registry if you're using Kafka) to track schema versions. Enforce backward compatibility: new fields can be added, but existing fields can't be removed or have their types changed. Parquet handles additive schema changes well - new columns get null values in old files, old columns stay intact in new files.

Anti-Patterns: What Not to Do

We've seen every data lake mistake in the book. Here are the ones that cause the most pain:

The small file problem: Millions of files under 1MB each. Usually caused by streaming pipelines that write one file per micro-batch. Athena queries against 5 million small files take 10x longer than the same data in 50 properly-sized Parquet files. Fix it with a compaction job.
No partitioning: Dumping all files into a flat directory. Every query does a full scan. Partitioning by date typically eliminates 90%+ of unnecessary data scanning.
Using CSV for everything: No compression, no columnar pruning, delimiter hell. We've seen a data lake where converting from gzipped CSV to Snappy Parquet reduced storage by 70% and query costs by 85%.
No data catalog: If the only way to know what's in your data lake is to read the S3 path and guess, you've got a data swamp, not a data lake. Register everything in Glue Data Catalog from day one.
Giving everyone s3:* permissions: One accidental aws s3 rm --recursive and your entire curated zone is gone. Use IAM policies with least-privilege access and enable S3 versioning on critical buckets.

Conclusion

A well-architected S3 data lake isn't complicated, but it does require deliberate design decisions upfront. The three-zone structure, Hive-style partitioning, Parquet format, and Glue Data Catalog form the foundation. Layer on KMS encryption, lifecycle policies, and a schema registry, and you've got a data lake that scales to petabytes without becoming unmanageable. The teams that struggle are the ones that skip these decisions early and try to retrofit structure onto a data swamp 18 months later. Don't be that team.

Chakri, Cloud Solutions Architect

Chakri is a Cloud Solutions Architect at CelestInfo with hands-on experience across AWS, Azure, GCP, and Snowflake cloud infrastructure.

Burning Questions
About CelestInfo

Simple answers to make things clear.

What's the best file format for an S3 data lake?+

Parquet for analytics (columnar, compressed, supports predicate pushdown). Avro for streaming ingestion (row-based, fast writes, schema evolution). Avoid CSV and raw JSON for large-scale analytics.

How should I partition my S3 data lake?+

Use Hive-style partitioning (year=2026/month=02/day=15) based on your most common query filters. Each partition should contain at least a few hundred MB of data. Over-partitioning creates the small file problem.

What is the small file problem?+

When your data lake has millions of files under 1MB each. Every file needs a separate API call, catalog entry, and processing overhead. Fix it with compaction jobs that merge small files into 128MB-1GB Parquet files.

Should I use Glue Crawlers or manual partition registration?+

Glue Crawlers work for initial discovery, but can be unpredictable with schema evolution. For production pipelines, register partitions explicitly with ALTER TABLE ADD PARTITION in Athena for better control.

How do I reduce S3 data lake storage costs?+

Set up lifecycle policies: move raw data to Glacier after 90 days, delete temp data after 7 days, and use S3 Infrequent Access for curated data older than 6 months. Converting from CSV to Parquet alone can reduce storage by 70%.

Still have questions?

Building a Data Lake on S3: Architecture Patterns That Scale

The Three-Zone Architecture

Partitioning Strategy: Get This Right Early

File Format Choices: Parquet, Avro, or JSON?

File Sizing: The 128MB-1GB Sweet Spot

Metadata Management: Glue Data Catalog vs Hive Metastore

Security: Bucket Policies + IAM + KMS

Lifecycle Policies: Automate Cost Management

Schema Registry: Plan for Schema Evolution

Anti-Patterns: What Not to Do

Conclusion

Related Articles

Burning Questions
About CelestInfo

Ready? Let's Talk!

Building a Data Lake on S3: Architecture Patterns That Scale

The Three-Zone Architecture

Partitioning Strategy: Get This Right Early

File Format Choices: Parquet, Avro, or JSON?

File Sizing: The 128MB-1GB Sweet Spot

Metadata Management: Glue Data Catalog vs Hive Metastore

Security: Bucket Policies + IAM + KMS

Lifecycle Policies: Automate Cost Management

Schema Registry: Plan for Schema Evolution

Anti-Patterns: What Not to Do

Conclusion

Related Articles

Burning QuestionsAbout CelestInfo

Burning Questions
About CelestInfo