How a State Agency Made 10 Years of Open Data Actually Searchable
14 databases, 10 years of traffic and accident data, FOIA requests that took weeks, and a state audit deficiency. Here's how we consolidated everything into Azure Synapse and built a public API that 200+ external users rely on.
The Challenge
This state transportation agency manages roads, bridges, and traffic infrastructure across 42 counties. Over the past decade, they'd accumulated traffic volume counts, accident reports, road maintenance logs, and bridge inspection records - roughly 380 million rows of data. The problem wasn't volume. The problem was that this data lived in 14 different databases spread across 4 technology stacks: Oracle 12c for the traffic management system, SQL Server 2016 for accident reporting, Microsoft Access (yes, really) for bridge inspection records, and a collection of flat CSV files sitting on an FTP server that someone had set up in 2014 and never migrated.
When a FOIA request came in - and they got about 15-20 per month - an analyst had to figure out which system held the relevant data, manually export it, cross-reference with other sources if needed, and package it for release. Average turnaround: 3 to 6 weeks. Some complex requests took longer. The agency's public-facing open data portal offered CSV downloads, but those files hadn't been updated in over 2 years because the manual export process was too time-consuming for the team to prioritize.
The final trigger: a state audit explicitly flagged data accessibility as a major deficiency. The auditors noted that the agency couldn't produce a consistent, authoritative dataset for any of its core functions. That finding went public, and the agency had 6 months to show measurable progress.
What We Built
We consolidated all 14 data sources into Azure Synapse Analytics over 13 weeks. The first 3 weeks were almost entirely data profiling - we needed to understand what was actually in each source, because documentation ranged from "sparse" to "nonexistent." The Access databases had 47 tables between them, and about 30% of those tables were duplicates with slightly different schemas that someone had created years ago and never cleaned up.
Azure Data Factory handled the ingestion. We built 14 separate pipelines, one per source. The Oracle and SQL Server connections were straightforward. The Access databases required a self-hosted integration runtime running on a VM in their data center because ADF can't natively connect to Access files over the network. The FTP flat files needed custom parsing logic because the CSV formats had changed 3 times over 10 years with no documentation of which format applied to which files - we identified the format based on column count and header patterns.
The real investment was the data quality framework. We built 45 business rules that validate every record before it lands in the serving layer. Examples: accident records must have a valid county code, traffic counts can't exceed the physical capacity of the road segment, date fields can't be in the future. Records that fail validation land in a quarantine table for analyst review. In the first run, about 8% of historical records failed at least one rule - mostly bad county codes in the older data.
For external access, we deployed Azure API Management on top of Synapse Serverless SQL pools. Researchers and developers can query current data through a REST API with proper authentication instead of downloading 2GB CSV files. We also replaced the public open data portal's static downloads with live API-backed endpoints. Internally, analysts got Power BI dashboards connected to Synapse, with role-based access so county-level staff only see their own county's data.
FOIA requests now pull from the same curated dataset. An analyst searches the unified platform, applies the relevant filters, and exports. What used to take weeks takes hours.
Results
Tech Stack
What We Learned
- Data profiling is the real project. We budgeted 2 weeks for profiling and spent 3. In government work, data sources have been touched by dozens of people over decades. The Access databases alone had 47 tables, and understanding which ones were canonical versus abandoned copies was critical. Don't skip this step.
- A self-hosted integration runtime solves the "weird source" problem. ADF's cloud connectors cover most scenarios, but when you're pulling from Access databases on a shared network drive, you need a self-hosted IR. It took half a day to set up and saved us from building a completely separate ingestion path for those sources.
- Data quality rules surface problems you didn't know existed. That 8% failure rate on historical records wasn't a surprise to the senior analysts - they'd been manually correcting these issues for years during FOIA exports. Codifying the rules into an automated framework meant the corrections happened once, not every time someone needed the data.
- Government API adoption is slow but real. We launched the public API expecting maybe a handful of users in the first quarter. Within 3 months, 200+ researchers, journalists, and app developers had registered. Turns out, plenty of people wanted this data - they just couldn't deal with 2-year-old CSV downloads.
Sitting on Data Nobody Can Find?
Legacy databases, scattered files, manual exports - we've untangled data messes like this before. Tell us what you're working with.
Start a Conversation