Building a Production Talend-to-Snowflake ETL Pipeline: Architecture and Gotchas

Celestinfo Software Solutions Pvt. Ltd. Jan 30, 2025

Last updated: February 2025

Quick answer: The fastest production pattern is: parent orchestration job that calls child jobs for extract, transform, and load. Use the Snowflake JDBC driver 3.13.x+ with a fetch size of 10,000+. For bulk loading, stage files to S3 or Azure Blob, then COPY INTO Snowflake -- it's 5-10x faster than row-by-row inserts. Handle errors with tLogCatcher writing to a dedicated error table in Snowflake.

Job Architecture: Parent + Child Pattern

A production Talend-to-Snowflake pipeline needs three components: a parent orchestration job for flow control, child jobs for each table extraction, and a Snowflake bulk loader (tSnowflakeOutput with COPY INTO). This architecture handles millions of records daily with checkpoint recovery, parallel execution, and automated error routing. Below is the exact job structure we use across 30+ production deployments.. It calls child jobs for each phase: extraction, transformation, and loading. The parent handles logging, error aggregation, and notifications. Each child job does exactly one thing.

Why not put everything in one job? Because debugging a 200-component job at 2am is miserable. With the parent-child pattern, if the load phase fails, you can re-run just that child job with the already-extracted data. You don't have to re-extract from source systems, which matters when those systems have rate limits or maintenance windows.

Typical structure:

Snowflake JDBC Connection Setup

Use the Snowflake JDBC driver, not the generic JDBC driver. The Snowflake-specific driver (net.snowflake.client.jdbc.SnowflakeDriver) supports features like multi-statement execution, PUT/GET commands, and Snowflake-specific session parameters. Use version 3.13.x or later - earlier versions have known issues with timezone handling and VARIANT column reads.

In Talend, set this up in a tDBConnection component with these parameters:

JDBC connection string
jdbc:snowflake://<account>.snowflakecomputing.com/
  ?db=PROD_DB
  &schema=PUBLIC
  &warehouse=ETL_WH
  &role=ETL_ROLE
  &CLIENT_SESSION_KEEP_ALIVE=true

The CLIENT_SESSION_KEEP_ALIVE=true parameter prevents the session from timing out during long-running jobs. Without it, Talend jobs that run for more than 4 hours will get disconnected mid-load.

Bulk Loading: Stage + COPY INTO

Row-by-row inserts through JDBC are painfully slow for anything over 10,000 rows. The production pattern is: write files to a stage (S3 external stage or Snowflake internal stage), then run COPY INTO. This is 5-10x faster because Snowflake can parallelize the file ingestion across its compute nodes.

In Talend, use tSnowflakeOutputBulk to write CSV files to a local directory, then tSnowflakeBulkExec to PUT and COPY them into Snowflake. Or, write directly to S3 using tFileOutputDelimited with an S3 connection, then execute the COPY INTO via tDBRow.

SQL - COPY INTO from S3 external stage
COPY INTO prod_db.staging.orders
FROM @prod_db.staging.s3_stage/orders/
  FILE_FORMAT = (
    TYPE = 'CSV'
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1
    NULL_IF = ('NULL', '')
    ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
  )
  ON_ERROR = 'CONTINUE'
  PURGE = TRUE;

The ON_ERROR = 'CONTINUE' setting loads all valid rows and skips bad ones. After the COPY, check VALIDATE() or the COPY_HISTORY view to find which rows were rejected and why.

Handling Snowflake's VARIANT Type

Snowflake's VARIANT type stores semi-structured data (JSON, XML, Avro). Talend doesn't natively understand VARIANT, so you need a workaround. The cleanest approach: load your JSON as a string into a VARCHAR column in a staging table, then use a Snowflake view or downstream SQL to parse it with Snowflake's PARSE_JSON() function. Alternatively, load the raw JSON file directly into a single-column VARIANT table using COPY INTO with TYPE = 'JSON'.

SQL - Load JSON into VARIANT column
COPY INTO staging.raw_events (event_data)
FROM @s3_stage/events/
  FILE_FORMAT = (TYPE = 'JSON')
  ON_ERROR = 'CONTINUE';

Error Handling: tLogCatcher + Error Table

Don't just let jobs fail silently. Every parent job should have a tLogCatcher component connected to a tDBOutput that writes to a dedicated error table in Snowflake. Log the job name, component name, error message, error code, timestamp, and - critically - enough context to identify the source row.

SQL - Error logging table
CREATE TABLE ops.etl_error_log (
  error_id      NUMBER AUTOINCREMENT,
  job_name      VARCHAR(200),
  component     VARCHAR(200),
  error_code    VARCHAR(50),
  error_message VARCHAR(4000),
  source_row    VARCHAR(4000),
  logged_at     TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

For rejected rows during COPY INTO, query VALIDATE(table_name, job_id => '_last') right after the COPY and insert the rejected rows into your error table. This gives you a complete audit trail of what failed and why.

Parallel Execution with tParallelize

If you're loading 20 independent tables, don't load them sequentially. Talend's tParallelize component runs multiple subjobs concurrently. Group independent loads together and run them in parallel. Just make sure each parallel branch uses its own Snowflake connection - sharing a single connection across parallel threads causes locking issues.

We typically set parallelism to 4-6 concurrent loads. Higher than that and you risk saturating the Snowflake warehouse or hitting the Snowflake session limit (which defaults to 500 per account but can be lower depending on your plan).

Incremental Loads

Full reloads are simple but wasteful. For incremental loads, you've got two options:

The timestamp-based approach is simpler and works across any source system. Snowflake's CHANGES clause is cleaner but only works when reading from Snowflake itself.

Job Scheduling: TAC vs External Schedulers

Talend Administration Center (TAC) handles basic scheduling: run job X at 6am daily. It works fine for simple schedules. But if you need dependency management (run job B only after job A succeeds), cross-tool orchestration (run a Talend job after a dbt run completes), or dynamic scheduling, you'll outgrow TAC quickly. Airflow is the most common replacement we see. Define your Talend jobs as BashOperators in an Airflow DAG and let Airflow handle the dependency graph.

Gotchas You'll Hit

Conclusion

A production Talend-to-Snowflake pipeline isn't complicated, but it has a lot of moving parts that need to be configured correctly. Use the parent-child job pattern for maintainability. Stage files to S3 and COPY INTO for performance. Set the JDBC fetch size to 10,000+. Log errors to a dedicated table. And test your timezone handling before it bites you at 3am. Get these fundamentals right and the pipeline will run reliably for years.

Chandra Sekhar, Senior ETL Engineer

Chandra Sekhar is a Senior ETL Engineer at CelestInfo specializing in Talend, Azure Data Factory, and building high-performance data integration pipelines.

Related Articles

Burning Questions
About CelestInfo

Simple answers to make things clear.

Use the Snowflake JDBC driver (3.13.x+) with tDBConnection for more control over connection parameters and bulk loading. The built-in Snowflake components work for simple cases but are limited for production pipelines.

Stage files to S3 or Azure Blob Storage, then COPY INTO. This is 5-10x faster than row-by-row JDBC inserts. Use tSnowflakeOutputBulk + tSnowflakeBulkExec, or write to S3 and execute COPY INTO via tDBRow.

Use tLogCatcher connected to a tDBOutput that writes to a dedicated error table in Snowflake. Log job name, component, error message, and source row context. For COPY INTO rejections, query VALIDATE() after each load.

The most common cause is the default JDBC fetch size of 1000. Increase it to 10,000+ in tDBConnection's Advanced Settings. This reduces round trips between Talend and the source database significantly.

Load the JSON file directly into a single-column VARIANT table using COPY INTO with TYPE = 'JSON'. Or load JSON as a VARCHAR string into a staging table and parse it downstream with Snowflake's PARSE_JSON() function.

Still have questions?

Ready? Let's Talk!

Get expert insights and answers tailored to your business requirements and transformation.