Telecommunications 6,000 Employees 12 Weeks

Redesigning a Telecom's CDR Analytics from Flat Files to a Star Schema on Snowflake

2.1 billion Call Detail Records per month, a Teradata system hitting its license ceiling at $3.2M/year, and 300+ BTEQ scripts nobody understood. Here's how we migrated to Snowflake and cut costs by 85%.

The Challenge

This telecom processes 2.1 billion Call Detail Records every month. Each CDR is a single call event: caller number, recipient number, cell tower, start time, duration, call type (voice, SMS, data), and about 40 additional fields including signal strength, handoff events, and billing codes. These CDRs were being dumped as fixed-width flat files onto a SAN from the network switch infrastructure every 15 minutes.

The analytics layer was a Teradata system that had been in place for 8 years. Over that time, the BI team had accumulated 300+ BTEQ scripts that ran in a carefully orchestrated sequence. Some scripts depended on others. Some were duplicates that nobody was sure about. The original architect had left 3 years ago, and documentation was sparse - mostly comments like "DO NOT CHANGE THIS" and "TODO: figure out why this works."

Reports on call quality metrics, network utilization, and customer usage patterns took 4+ hours to run. The regulatory team needed daily churn analysis for number portability compliance, and they were getting it 6 hours late every morning. The network operations center needed near-real-time views of cell tower load to manage capacity, and the best they could get was yesterday's data.

The Teradata license renewal was $3.2M/year. Leadership had been paying it because the migration risk felt too high. But the system was hitting its licensed capacity - adding more data or more users required a license tier upgrade that would push costs past $4.5M. That was the trigger.

What We Built

We designed a proper dimensional model in Snowflake. The core of the schema is fact_cdr - one row per call event, grain at the individual CDR level. The fact table references 6 dimension tables: dim_subscriber (customer demographic and plan information), dim_cell_tower (location, capacity, technology generation), dim_date, dim_time (separate from date for time-of-day analysis), dim_call_type (voice, SMS, data, VoLTE, etc.), and dim_network_element (switches, routers, and their firmware versions).

Talend handles the extraction from CDR file drops. The tricky part was the fixed-width format - each field occupies a specific character position, and different switch vendors use slightly different layouts. We built Talend jobs with configurable column position mappings per vendor, so when a new switch type is added, the team just updates a configuration file rather than modifying ETL code. Data lands in Snowflake's bronze layer as raw, unmodified CDRs.

dbt handles all transformations from bronze to the star schema. We implemented SCD Type 2 on both dim_subscriber and dim_cell_tower using dbt snapshots. When a customer changes their plan or a cell tower gets a hardware upgrade, we preserve the historical record so that reports against old CDRs still join to the dimension values that were active at the time of the call. This is critical for regulatory reporting - you can't retroactively change what a customer's plan was 6 months ago.

We didn't try to migrate all 300+ Teradata scripts. We identified the 47 most critical reports (the ones actually used in the last 90 days), rebuilt them against the new star schema, and validated every one against the Teradata output. Row counts had to match. Metric totals had to match within a 0.01% tolerance (accounting for floating-point differences between platforms). Once validated, we decommissioned Teradata in 3 phases over 4 weeks, moving user groups off one department at a time.

Results

$3.2M → $480K/yr Teradata replaced by Snowflake
4 hrs → 90 sec Report execution time
2.1B CDRs/mo Processed with headroom to grow
47 Reports Migrated and validated

Tech Stack

Snowflake dbt Talend Teradata (decommissioned) Power BI

What We Learned

  • Migrate the reports people use, not the reports that exist. Of 300+ BTEQ scripts, only 47 had been run in the last 90 days. Another 80 hadn't been run in over a year. We asked every department head to sign off on their critical reports, and anything unclaimed didn't get migrated. This cut the migration scope by 85% and nobody noticed the missing reports.
  • Fixed-width file parsing is a minefield. Three different switch vendors produced CDRs with different field layouts. One vendor used a 2-byte country code, another used 3 bytes. One encoded timestamps as epoch seconds, another as ISO 8601 strings. We built the Talend jobs with a vendor configuration layer so the parsing logic is driven by metadata, not hardcoded position offsets. This saved us twice during the project when a switch firmware update changed a field width.
  • SCD Type 2 on cell towers was non-negotiable for regulatory compliance. The regulator required the ability to report on network quality metrics tied to the tower configuration that was active at the time of the call. Without SCD Type 2, a tower upgrade would retroactively change historical reports. dbt snapshots made this manageable, but the initial load of historical tower changes from Teradata took 3 days to reconcile.
  • Snowflake's auto-scaling handled the CDR volume without tuning. 2.1 billion rows per month is about 70 million per day, or roughly 48,000 per minute. We sized the Snowflake warehouse at MEDIUM for ETL loads and SMALL for analytics queries. During peak hours, the multi-cluster auto-scaling spins up additional clusters. Total Snowflake cost: $40K/month, versus $267K/month for Teradata. That's an 85% reduction.

Stuck on a Legacy Data Warehouse?

Teradata, Netezza, Oracle Exadata - we've migrated them all to Snowflake. Tell us what you're running and we'll tell you what the migration looks like.

Start a Conversation