Building a Production Talend-to-Snowflake ETL Pipeline: Architecture and Gotchas
Last updated: February 2025
Quick answer: The fastest production pattern is: parent orchestration job that calls child jobs for extract, transform, and load. Use the Snowflake JDBC driver 3.13.x+ with a fetch size of 10,000+. For bulk loading, stage files to S3 or Azure Blob, then COPY INTO Snowflake -- it's 5-10x faster than row-by-row inserts. Handle errors with tLogCatcher writing to a dedicated error table in Snowflake.
Job Architecture: Parent + Child Pattern
A production Talend-to-Snowflake pipeline needs three components: a parent orchestration job for flow control, child jobs for each table extraction, and a Snowflake bulk loader (tSnowflakeOutput with COPY INTO). This architecture handles millions of records daily with checkpoint recovery, parallel execution, and automated error routing. Below is the exact job structure we use across 30+ production deployments.. It calls child jobs for each phase: extraction, transformation, and loading. The parent handles logging, error aggregation, and notifications. Each child job does exactly one thing.
Why not put everything in one job? Because debugging a 200-component job at 2am is miserable. With the parent-child pattern, if the load phase fails, you can re-run just that child job with the already-extracted data. You don't have to re-extract from source systems, which matters when those systems have rate limits or maintenance windows.
Typical structure:
job_orchestrator- Parent job. Calls children in sequence, handles global error catching via tLogCatcher, sends Slack/email alerts on failure.job_extract_orders- Reads from source (MySQL, API, SFTP), writes CSV/Parquet to a staging area (local filesystem or S3).job_transform_orders- Reads staged files, applies business rules, deduplication, type conversions. Writes transformed files back to staging.job_load_orders- PUT files to Snowflake internal stage or S3 external stage, then COPY INTO target tables.
Snowflake JDBC Connection Setup
Use the Snowflake JDBC driver, not the generic JDBC driver. The Snowflake-specific driver (net.snowflake.client.jdbc.SnowflakeDriver) supports features like multi-statement execution, PUT/GET commands, and Snowflake-specific session parameters. Use version 3.13.x or later - earlier versions have known issues with timezone handling and VARIANT column reads.
In Talend, set this up in a tDBConnection component with these parameters:
jdbc:snowflake://<account>.snowflakecomputing.com/ ?db=PROD_DB &schema=PUBLIC &warehouse=ETL_WH &role=ETL_ROLE &CLIENT_SESSION_KEEP_ALIVE=true
The CLIENT_SESSION_KEEP_ALIVE=true parameter prevents the session from timing out during long-running jobs. Without it, Talend jobs that run for more than 4 hours will get disconnected mid-load.
Bulk Loading: Stage + COPY INTO
Row-by-row inserts through JDBC are painfully slow for anything over 10,000 rows. The production pattern is: write files to a stage (S3 external stage or Snowflake internal stage), then run COPY INTO. This is 5-10x faster because Snowflake can parallelize the file ingestion across its compute nodes.
In Talend, use tSnowflakeOutputBulk to write CSV files to a local directory, then tSnowflakeBulkExec to PUT and COPY them into Snowflake. Or, write directly to S3 using tFileOutputDelimited with an S3 connection, then execute the COPY INTO via tDBRow.
COPY INTO prod_db.staging.orders
FROM @prod_db.staging.s3_stage/orders/
FILE_FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
NULL_IF = ('NULL', '')
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
)
ON_ERROR = 'CONTINUE'
PURGE = TRUE;
The ON_ERROR = 'CONTINUE' setting loads all valid rows and skips bad ones. After the COPY, check VALIDATE() or the COPY_HISTORY view to find which rows were rejected and why.
Handling Snowflake's VARIANT Type
Snowflake's VARIANT type stores semi-structured data (JSON, XML, Avro). Talend doesn't natively understand VARIANT, so you need a workaround. The cleanest approach: load your JSON as a string into a VARCHAR column in a staging table, then use a Snowflake view or downstream SQL to parse it with Snowflake's PARSE_JSON() function. Alternatively, load the raw JSON file directly into a single-column VARIANT table using COPY INTO with TYPE = 'JSON'.
COPY INTO staging.raw_events (event_data) FROM @s3_stage/events/ FILE_FORMAT = (TYPE = 'JSON') ON_ERROR = 'CONTINUE';
Error Handling: tLogCatcher + Error Table
Don't just let jobs fail silently. Every parent job should have a tLogCatcher component connected to a tDBOutput that writes to a dedicated error table in Snowflake. Log the job name, component name, error message, error code, timestamp, and - critically - enough context to identify the source row.
CREATE TABLE ops.etl_error_log ( error_id NUMBER AUTOINCREMENT, job_name VARCHAR(200), component VARCHAR(200), error_code VARCHAR(50), error_message VARCHAR(4000), source_row VARCHAR(4000), logged_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP() );
For rejected rows during COPY INTO, query VALIDATE(table_name, job_id => '_last') right after the COPY and insert the rejected rows into your error table. This gives you a complete audit trail of what failed and why.
Parallel Execution with tParallelize
If you're loading 20 independent tables, don't load them sequentially. Talend's tParallelize component runs multiple subjobs concurrently. Group independent loads together and run them in parallel. Just make sure each parallel branch uses its own Snowflake connection - sharing a single connection across parallel threads causes locking issues.
We typically set parallelism to 4-6 concurrent loads. Higher than that and you risk saturating the Snowflake warehouse or hitting the Snowflake session limit (which defaults to 500 per account but can be lower depending on your plan).
Incremental Loads
Full reloads are simple but wasteful. For incremental loads, you've got two options:
- Snowflake's CHANGES clause: If you've enabled change tracking on the source table (
ALTER TABLE SET CHANGE_TRACKING = TRUE), you can querySELECT * FROM table CHANGES(INFORMATION => DEFAULT) AT(OFFSET => -3600)to get rows modified in the last hour. Clean and database-native. - Talend's built-in CDC: Use a tDBInput with a WHERE clause that filters on a
modified_atorupdated_attimestamp. Store the last successful extraction timestamp in a context variable or a control table. On the next run, pick up where you left off.
The timestamp-based approach is simpler and works across any source system. Snowflake's CHANGES clause is cleaner but only works when reading from Snowflake itself.
Job Scheduling: TAC vs External Schedulers
Talend Administration Center (TAC) handles basic scheduling: run job X at 6am daily. It works fine for simple schedules. But if you need dependency management (run job B only after job A succeeds), cross-tool orchestration (run a Talend job after a dbt run completes), or dynamic scheduling, you'll outgrow TAC quickly. Airflow is the most common replacement we see. Define your Talend jobs as BashOperators in an Airflow DAG and let Airflow handle the dependency graph.
Gotchas You'll Hit
- Default JDBC fetch size is 1000. For large extracts (millions of rows), this means thousands of round trips between Talend and the source database. Set it to 10,000+ in the tDBConnection's Advanced Settings. We've seen 3x extraction speed improvements just from this one change.
- The Snowflake bulk loader PUT has a 16MB file size limit. If your staged files exceed 16MB, the PUT command in tSnowflakeBulkExec will fail. Split large files before staging, or use S3 directly (which has no file size limit for staging).
- VARCHAR defaults to 16,777,216 bytes in Snowflake. If Talend auto-creates your target table, every VARCHAR column will be 16MB. This doesn't affect storage (Snowflake stores only the actual data length), but it bloats the table metadata and can confuse downstream tools like Power BI that read the column definition and try to allocate memory accordingly. Always define explicit column lengths.
- Timezone handling between Talend and Snowflake is tricky. Talend uses the JVM's default timezone. Snowflake has TIMESTAMP_NTZ (no timezone), TIMESTAMP_LTZ (local timezone), and TIMESTAMP_TZ (with timezone). If you don't align these, you'll get shifted timestamps. Our rule: use TIMESTAMP_NTZ everywhere and ensure the JVM timezone is set to UTC with
-Duser.timezone=UTCin the JVM arguments. - Talend's memory management with large datasets. Talend processes data in memory by default. If you're transforming 50 million rows with tMap, you'll run out of heap space. Use tMap's "Store temp data" option to spill to disk, or redesign to process data in chunks using tFlowToIterate with a batch size.
Conclusion
A production Talend-to-Snowflake pipeline isn't complicated, but it has a lot of moving parts that need to be configured correctly. Use the parent-child job pattern for maintainability. Stage files to S3 and COPY INTO for performance. Set the JDBC fetch size to 10,000+. Log errors to a dedicated table. And test your timezone handling before it bites you at 3am. Get these fundamentals right and the pipeline will run reliably for years.
