Processing 100M+ Rows in Talend Without Running Out of Memory
Last updated: July 2025
Quick answer: Talend's default JVM heap is too small for 100M+ rows. Increase it to 4-8GB with -Xmx8g. Replace row-by-row inserts with bulk operations (tBulkExec). Use tMap's "Reload at each row" for large lookups instead of loading everything into memory. Parallelize independent sub-jobs with tParallelize, and stage intermediate data to disk with tFileOutputDelimited instead of holding it in memory.
Introduction
Every Talend developer hits the same wall eventually: the job works fine on 1 million rows, then you point it at the production dataset with 100 million rows and get java.lang.OutOfMemoryError: Java heap space. The default Talend configuration isn't built for large datasets. This guide covers every optimization we've used across dozens of production Talend implementations to process datasets with 100M+ rows reliably. For the specific heap memory error and its quick fixes, see our heap memory troubleshooting guide.
Why Talend Runs Out of Memory
Talend jobs run on the JVM. The default heap size in Talend Studio is typically 1-2GB. That sounds like a lot until you realize what's happening in memory during a tMap operation.
When tMap loads a lookup, it reads every row from the lookup source into a Java HashMap. Each row becomes a HashMap entry with the key column(s) as the key and the entire row as the value. A 50-million-row lookup table with 10 columns of average width? That's easily 4-6GB of heap. Your 2GB JVM doesn't stand a chance.
The other common culprit is row-by-row database operations. tDBOutput with "Insert" action commits one row at a time by default. On 100M rows, that's 100M individual INSERT statements, 100M network round-trips, and the JDBC driver holding transaction state in memory for each batch. This isn't just slow - it's a memory leak waiting to happen.
Increasing JVM Heap Size
The first fix is the simplest: give the JVM more memory. In Talend Studio, go to Run > Advanced Settings > JVM Settings and add:
-Xmx8g -Xms4g -XX:+UseG1GC
-Xmx8gsets the maximum heap to 8GB. Go higher if your server has the RAM.-Xms4gsets the initial heap to 4GB, avoiding the overhead of incremental allocation.-XX:+UseG1GCswitches to the G1 garbage collector, which handles large heaps better than the default collector. It reduces pause times significantly on heaps above 4GB.
For exported jobs (running outside Studio), edit the generated shell script. Find the java command and modify the -Xmx parameter. On Talend Cloud or TAC, set these in the job execution server configuration.
Important: increasing heap doesn't fix the underlying problem - it just raises the ceiling. If your job's memory usage grows linearly with data volume, you'll hit the wall again at 200M rows. The real fixes are below.
tMap Lookup Strategies
tMap is the biggest memory consumer in most Talend jobs. It has 3 lookup loading strategies, and picking the right one is critical. For a deeper dive into tMap troubleshooting, see our Talend performance tuning guide.
"Load Once" (Default)
The default behavior: tMap reads the entire lookup table into memory before processing the first main row. Fast for small lookups (under 1M rows). Catastrophic for large ones. A 50M-row lookup loaded this way will consume 4-6GB of heap, leaving nothing for the actual transformation.
"Reload at Each Row"
tMap re-executes the lookup query for every row in the main flow. This uses almost zero memory because only 1 lookup result is in memory at a time. The trade-off: it fires a separate SQL query for every main row. On 100M main rows, that's 100M queries. You need an index on the lookup join column or this will be slower than loading everything into memory.
When to use it: when the lookup table is large (10M+ rows) but the main flow has a reasonably selective join. Add an index on the join column and the per-row query executes in microseconds.
"Reload at Each Row" with Cache
A hybrid approach available in some Talend versions. It caches recent lookup results in a small LRU cache. If the same key appears multiple times in the main flow, it serves from cache instead of re-querying. Useful when main flow rows frequently hit the same lookup keys.
tHashInput and tHashOutput for Memory-Efficient Lookups
The tHashInput/tHashOutput pair lets you store intermediate data in a temporary hash storage that can spill to disk when memory is tight. Instead of holding a full dataset in a tMap lookup, you write it to a tHashOutput first, then reference it from tHashInput in a subsequent sub-job. This gives you more control over when data is loaded and released from memory.
The pattern: Sub-job 1 reads the lookup data and writes it to tHashOutput. Sub-job 2 reads the main flow and uses tHashInput as the lookup source in tMap. Because sub-job 1 completes before sub-job 2 starts, Talend can manage memory more efficiently.
Bulk Loading Instead of Row-by-Row
This is probably the single biggest performance improvement you can make. The difference between tDBOutput (row-by-row) and tBulkExec (bulk load) on 100M rows is the difference between 8 hours and 15 minutes. It's not a small optimization - it's an architectural change.
For Snowflake, use tSnowflakeBulkExec which stages data to cloud storage and runs a COPY INTO command. For MySQL, use tMysqlBulkExec with LOAD DATA INFILE. For PostgreSQL, tPostgresqlBulkExec uses the COPY command. For a detailed comparison between output components, see our tDBOutput vs tDBOutputBulk vs tDBBulkExec guide.
tFileInputDelimited --> tMap --> tFileOutputDelimited --> tSnowflakeBulkExec
(stage to file) (COPY INTO)
The pattern: transform your data and write it to a delimited file using tFileOutputDelimited. Then use tSnowflakeBulkExec (or the database-specific bulk exec component) to load that file in one operation. The file write is sequential and memory-efficient. The bulk load uses the database's optimized import path.
Parallelization with tParallelize and tPartitioner
tParallelize runs independent sub-jobs concurrently. If you're processing 10 input files, tParallelize can process all 10 simultaneously (limited by available CPU cores and memory). Each parallel execution gets its own thread with its own memory allocation.
tPartitioner splits a single data flow into multiple parallel threads based on a partition key. If you're processing a 100M-row table and partition by customer_region (with 8 regions), tPartitioner creates 8 threads, each processing ~12.5M rows. This works well when the downstream processing is CPU-bound rather than I/O-bound.
Gotcha: parallelization multiplies memory usage. 4 parallel threads each using 2GB of heap means you need at least 8GB total. Set your -Xmx accordingly, and don't parallelize more threads than your server has CPU cores.
tBufferOutput for Sorting Without Memory Overflow
Sorting 100M rows in memory requires loading all rows into a data structure, which is exactly what you're trying to avoid. tBufferOutput writes rows to a temporary buffer that automatically spills to disk when memory is low. Pair it with tBufferInput in the next sub-job to read the buffered data back in sorted order.
For external sorting on really large datasets, write to tFileOutputDelimited first, then use a system-level sort command (or break the file into sorted chunks and merge them). This is more work to set up but handles datasets of any size without memory concerns.
Database Connection Management
Two settings that matter a lot at scale:
- Commit interval: The default commit interval for tDBOutput is often 10,000 rows. On 100M rows, that's 10,000 commits. Each commit requires the database to flush its write-ahead log. Increase the commit interval to 100,000 or even 500,000 for large batch loads. This reduces commit overhead by 10-50x.
- Connection pooling: If your job has multiple sub-jobs hitting the same database, use
tDBConnectionat the beginning to create a shared connection. Without this, each sub-job opens and closes its own connection, which adds connection establishment overhead (200-500ms per connect on most databases) and can exhaust your database's connection pool.
tFileOutputDelimited as Intermediate Staging
When you have a complex multi-step transformation, don't try to do everything in one continuous flow. Break it into stages with file-based handoffs. Stage 1 reads source data, applies initial transformations, and writes to a temp file. Stage 2 reads the temp file, joins with lookup data, and writes to another temp file. Stage 3 bulk-loads the final file into the target database.
Each stage runs independently, processes data row-by-row (streaming, not buffered), and releases memory when it completes. The disk I/O overhead is minimal compared to the memory savings. On a modern SSD, writing and reading a 100M-row CSV file takes 3-5 minutes. That's nothing compared to the hours you'd lose debugging an out-of-memory crash at row 87 million.
Production Performance Checklist
- Use bulk operations for all database writes. tBulkExec components are 10-50x faster than tDBOutput for large datasets.
- Increase heap to 4-8GB with
-Xmx. Use G1GC for heaps above 4GB. - External sort via tFileOutputDelimited instead of in-memory sorting. Write to disk, sort on disk, read back.
- Parallelize independent sub-jobs with tParallelize. Don't exceed CPU core count.
- Minimize tLogRow usage in production. tLogRow writes to stdout for every row. On 100M rows, that's 100M console writes. Disable it or replace with tFileOutputDelimited for debugging output.
- Disable statistics monitoring in production runs. The Talend statistics panel (row counts, throughput) adds measurable overhead. In Studio, uncheck "Statistics" before running large jobs. In exported jobs, remove the
--statflag. - Set commit intervals to 100K+ rows for database output components.
- Use tAdvancedHash join instead of tMap for large lookups. It uses 20-30% less memory than tMap's HashMap implementation.
Gotcha: tMap's "Store Temp Data" Option
tMap has a "Store temp data" checkbox that writes intermediate data to disk instead of holding it in memory. Sounds like a silver bullet for memory issues, and it does prevent out-of-memory errors. But it adds 30-40% processing overhead because every row gets serialized to disk and deserialized back. On 100M rows, that's significant.
Use "Store temp data" as a safety net for jobs that occasionally process large datasets but usually handle smaller ones. If your job always processes 100M+ rows, it's better to redesign the pipeline (bulk loads, file staging, partitioning) than to rely on temp data spilling.
Key Takeaways
- Talend's default configuration can't handle 100M+ rows. You need to change JVM settings, component choices, and pipeline architecture.
- The biggest win is switching from row-by-row inserts to bulk operations. This alone can cut job runtime from hours to minutes.
- tMap lookup strategy matters: "Load once" for small lookups, "Reload at each row" with indexed join columns for large lookups.
- Break complex jobs into file-based stages. Disk I/O is cheap; memory crashes are expensive.
- Parallelization helps CPU-bound jobs but multiplies memory usage. Plan heap accordingly.
- tAdvancedHash join is faster and lighter than tMap for large lookups. Use it whenever the join logic is straightforward.
Related Articles
Frequently Asked Questions
Q: Why does Talend run out of memory on large datasets?
Talend runs on the JVM with a default heap size of 1-2GB. The tMap component loads lookup data entirely into memory by default. With a 50M-row lookup table, the HashMap holding that data can easily exceed available heap, causing an OutOfMemoryError.
Q: How do I increase Talend's JVM heap size?
In Talend Studio, go to Run > Advanced Settings > JVM Settings and set -Xmx4g or -Xmx8g. For exported jobs, edit the shell script and modify the java command's -Xmx parameter. For Talend Cloud/TAC, set heap parameters in the job's execution configuration.
Q: What is the difference between tMap and tAdvancedHash join?
tMap uses a Java HashMap for lookups, which is flexible but memory-intensive. tAdvancedHash uses a more memory-efficient hash join algorithm optimized for large lookups. For lookup tables over 10M rows, tAdvancedHash typically uses 20-30% less memory than tMap.
Q: Should I use tParallelize or tPartitioner for multi-threaded processing?
tParallelize runs independent sub-jobs concurrently (good for processing multiple files in parallel). tPartitioner splits a single data flow into multiple threads based on a partition key. Use tParallelize for independent tasks and tPartitioner for splitting a single large dataset.
