5 Data Integration Patterns Every Engineer Should Know: ETL, ELT, CDC, Streaming, and Reverse ETL
Quick answer: The 5 core data integration patterns are ETL (transform before loading), ELT (load then transform in-warehouse), CDC (capture incremental changes from database logs), Streaming (sub-second event processing via Kafka/Kinesis), and Reverse ETL (push warehouse data back to SaaS tools). Most modern stacks combine ELT + CDC + Reverse ETL. Only add streaming if you genuinely need sub-second latency - batch works fine for 90% of use cases.
Last updated: July 2025
Introduction
Five years ago, data integration meant one thing: ETL. Extract data from sources, transform it on a dedicated server, load it into a warehouse. That's still a valid pattern, but it's now one of five. The rise of cloud warehouses, event streaming platforms, and the "operational analytics" movement created four more patterns that solve different problems. Picking the wrong one costs you either in over-engineering (streaming when batch would work) or under-engineering (batch when you actually need real-time).
This guide covers all five patterns with specific tools, trade-offs, and a decision framework so you can choose the right one for your use case. For a deeper look at how to manage compute workloads for ETL vs analytics, see our dedicated guide.
Pattern 1: ETL (Extract, Transform, Load)
How It Works
Data is extracted from source systems, transformed on a dedicated ETL server (cleaning, joining, aggregating), and then loaded into the target data warehouse. The transformation happens before the data reaches the warehouse.
When to Use It
- Source data requires heavy cleanup before it's useful (deduplication, format standardization, PII masking).
- You're loading into an on-premise warehouse where compute is expensive and you want to minimize what gets stored.
- Compliance requirements mandate that raw data never enters your analytics environment.
Tools
Talend, Informatica PowerCenter, SSIS (SQL Server Integration Services), Apache NiFi, Talend-to-Snowflake pipelines.
Pros and Cons
- Pro: Only clean, validated data enters your warehouse. Smaller storage footprint.
- Pro: Fine-grained control over transformation logic before loading.
- Con: Requires dedicated ETL server infrastructure that you have to provision and maintain.
- Con: Slower development cycle - schema changes in the source require ETL pipeline updates before you can query the new data.
Real example: A hospital system extracts patient records from 12 different EMR systems, standardizes medical codes (ICD-10 mapping), masks SSNs, and loads the cleaned records into their analytics warehouse. Raw patient data with SSNs never enters the warehouse.
Pattern 2: ELT (Extract, Load, Transform)
How It Works
Data is extracted from sources and loaded raw into the warehouse. Transformation happens inside the warehouse using SQL, typically orchestrated by a tool like dbt. The warehouse's compute engine handles all the heavy lifting.
When to Use It
- Your warehouse has elastic compute (Snowflake, BigQuery, Redshift Serverless) where processing is cheap and scalable.
- You want analysts to have access to raw data for ad-hoc exploration alongside transformed models.
- You need fast iteration - load the data first, figure out transformations later.
Tools
dbt + Snowflake, dbt + BigQuery, dbt + Redshift. Ingestion via Fivetran, Airbyte, Hevo Data, or Stitch.
Pros and Cons
- Pro: No separate ETL server to manage. Transformations run on warehouse compute that auto-scales.
- Pro: Raw data is always available for re-transformation when business logic changes.
- Con: Raw data sits in your warehouse, increasing storage costs.
- Con: If your warehouse pricing is compute-heavy, large transformations get expensive.
Real example: An e-commerce company loads raw Shopify order data, Stripe payment events, and Google Analytics sessions into Snowflake via Fivetran. dbt models transform the raw data into customer lifetime value tables, cohort analyses, and revenue attribution models inside Snowflake.
Pattern 3: CDC (Change Data Capture)
How It Works
CDC reads the database transaction log (binlog in MySQL, WAL in PostgreSQL, redo log in Oracle) and captures only the rows that changed - inserts, updates, and deletes. These changes are streamed to the target system incrementally, without querying the source tables directly.
When to Use It
- You need near-real-time sync from a transactional database to your warehouse (5-second to 5-minute latency).
- The source table is large (100M+ rows) and full reloads would take hours or put unacceptable load on the source.
- You need to capture deletes - batch extracts can't reliably detect deleted rows without soft-delete columns.
Tools
Debezium (open source, Kafka-based), Hevo CDC, Fivetran (built-in CDC mode), AWS DMS, Talend CDC, Qlik Replicate.
Pros and Cons
- Pro: Near-real-time sync without full table scans on the source. Minimal impact on source database performance.
- Pro: Captures deletes and updates, not just inserts.
- Con: Requires access to database transaction logs, which some managed databases restrict.
- Con: Schema changes (DDL) in the source can break CDC pipelines. You need monitoring and alerting.
Real example: A banking platform uses Debezium to capture every transaction from their PostgreSQL core banking system. Changes flow through Kafka to the analytics warehouse within 30 seconds, enabling near-real-time fraud detection dashboards without putting any additional query load on the production database.
Pattern 4: Streaming
How It Works
Events are produced to a streaming platform (Kafka, Kinesis, Pulsar) and consumed by stream processing engines (Spark Structured Streaming, Flink, Kafka Streams) that transform and route them in real time. Data is processed event-by-event or in micro-batches (sub-second to seconds).
When to Use It
- You need sub-second latency: real-time fraud detection, live pricing engines, IoT sensor monitoring.
- The data is naturally event-shaped (clickstream, sensor readings, financial ticks) rather than table-shaped.
- You need to react to events immediately, not just analyze them later.
Tools
Apache Kafka + Kafka Streams, Amazon Kinesis + Lambda, Apache Flink, Spark Structured Streaming on Databricks, Confluent Cloud.
Pros and Cons
- Pro: Lowest possible latency. Data is available for processing within milliseconds of being produced.
- Pro: Natural fit for event-driven architectures and microservices communication.
- Con: Significantly more complex to operate. Kafka clusters, schema registries, consumer group management, exactly-once semantics - this is a full-time job for 1–2 engineers.
- Con: Debugging is harder. There's no "re-run the job from yesterday" button like batch pipelines have.
Real example: A ride-sharing company processes GPS events from 50,000 active drivers through Kafka. Spark Structured Streaming calculates surge pricing zones every 500 milliseconds based on driver density and ride request volume. A batch job couldn't react fast enough - pricing would always be 15 minutes stale.
Pattern 5: Reverse ETL
How It Works
Reverse ETL syncs modeled, enriched data from your data warehouse back to operational SaaS tools (Salesforce, HubSpot, Zendesk, Google Ads). You define a SQL model in your warehouse, map its columns to fields in the destination, and schedule syncs. For a deeper dive, see our dedicated Reverse ETL guide.
When to Use It
- Your sales team needs lead scores, but they live in Salesforce. Your warehouse has the best lead scoring model.
- Marketing wants to build ad audiences from warehouse segments, pushed to Google Ads or Meta.
- Support needs customer health scores in Zendesk, derived from product usage data in your warehouse.
Tools
Census, Hightouch, Polytomic, built-in Reverse ETL features in Fivetran and Hevo.
Pros and Cons
- Pro: Operational teams get enriched data where they already work, without learning SQL or logging into a BI tool.
- Pro: The warehouse becomes the single source of truth - no more manually uploading CSVs to Salesforce.
- Con: API rate limits in destination tools constrain sync volume and frequency.
- Con: You need well-modeled "sync-ready" data marts. Reverse ETL doesn't fix messy data - it amplifies it.
Real example: A B2B SaaS company calculates customer health scores in Snowflake using product usage data, support ticket volume, and billing history. Census pushes those scores to Salesforce nightly. When a CSM opens an account, they see the health score right on the Salesforce record - no dashboard hopping required.
Decision Framework: How to Choose
Four factors drive the decision:
| Factor | ETL | ELT | CDC | Streaming | Reverse ETL |
|---|---|---|---|---|---|
| Latency need | Hours | Hours | Seconds–minutes | Milliseconds | Minutes–hours |
| Data volume | Any | Any | High (incremental) | High throughput | Low–medium |
| Team skill | ETL tools | SQL + dbt | DB internals | Distributed systems | SQL + SaaS APIs |
| Complexity | Medium | Low | Medium | High | Low |
The common mistake: choosing streaming when daily batch would work fine. We've seen teams spend 6 months building Kafka-based streaming pipelines for data that powers weekly reports. If nobody checks the dashboard more than once a day, hourly batch ELT is simpler, cheaper, and just as effective. Save streaming for problems that genuinely need sub-second responses.
The modern default: most production stacks in 2026 use ELT (Fivetran/Hevo + dbt + Snowflake) as the core pattern, CDC for high-volume transactional tables that need near-real-time sync, and Reverse ETL to push warehouse data back to operational tools. Streaming is reserved for genuine real-time use cases like fraud detection, live personalization, or IoT processing.
Key Takeaways
- ETL transforms before loading - use it when raw data shouldn't enter the warehouse (compliance, PII concerns).
- ELT loads raw data first and transforms in-warehouse - the default pattern for cloud-native stacks with elastic compute.
- CDC captures incremental changes from database logs - the right choice for large transactional tables that need near-real-time sync.
- Streaming handles sub-second event processing - powerful but operationally expensive. Don't over-engineer.
- Reverse ETL pushes warehouse data to SaaS tools - bridges the gap between analytics and operations.
- Most modern stacks combine ELT + CDC + Reverse ETL. Only add streaming when you've confirmed batch won't work.
Related Articles
Q: What is the difference between ETL and ELT?
ETL transforms data on a dedicated server before loading it into the warehouse. ELT loads raw data first and transforms it inside the warehouse using its compute power. ELT is cheaper and more flexible with modern cloud warehouses like Snowflake and BigQuery because you avoid maintaining separate transformation infrastructure.
Q: When should I use CDC instead of batch ETL?
Use CDC when you need near-real-time data sync without full table reloads. CDC captures only the rows that changed (inserts, updates, deletes) from the source database transaction log. It's ideal for large transactional tables where reloading millions of rows every hour would be too slow or expensive.
Q: What is Reverse ETL and why does it matter?
Reverse ETL syncs modeled data from your data warehouse back to operational tools like Salesforce, HubSpot, and Zendesk. It matters because your warehouse has the most complete, cleaned, and enriched customer data, but the teams who need it work in SaaS tools that can't query SQL directly.
Q: Is streaming always better than batch processing?
No. Streaming adds significant operational complexity (Kafka clusters, schema registries, exactly-once semantics). If your business can tolerate 15-minute or hourly latency, batch or micro-batch CDC is simpler, cheaper, and easier to debug. Only choose streaming when you genuinely need sub-second latency.