From Reactive to Predictive: How a Steel Manufacturer Saved $3.1M in Unplanned Downtime

Steel Manufacturing · 3,200 Employees 16 Weeks

847 sensors across 12 production lines, generating roughly 2TB of data every day. All of it was being written to local historian servers - and none of it was being analyzed. The historians were glorified log files. When a bearing failed on the hot rolling mill, nobody knew until the line stopped. Each unplanned stoppage cost an average of $180K in lost production time, scrap material, and emergency repair labor.


The maintenance team's "predictive maintenance system" was a literal whiteboard in the break room labeled "Machines That Sound Funny." We're not exaggerating. A shift supervisor would walk the floor, listen to equipment, and jot down notes. Maintenance was 90% reactive: something breaks, someone fixes it. The plant manager knew there was gold in the sensor data, but the OT (operational technology) team didn't have the tools or the skills to build analytics on top of historian data, and the IT team didn't have access to the production network.

We deployed AWS IoT Core as the ingestion layer, connecting all 847 sensors via MQTT. Each sensor publishes readings every 5 seconds - vibration amplitude, temperature, pressure, current draw, RPM - and IoT Core routes messages through IoT Rules into Kinesis Data Streams. We partitioned Kinesis by production line, with each of the 12 lines getting its own shard group to avoid hot-partition issues.


From Kinesis, a Kinesis Data Firehose delivery stream writes compressed Parquet files to S3, partitioned by line_id/date/hour. This gives us roughly 167GB per partition per day in a format that's immediately queryable. On top of S3, we built a Databricks lakehouse using Delta Lake for ACID transactions - critical when you're doing concurrent reads (dashboards) and writes (streaming ingestion) on the same tables.


The ML piece was where the real value came from. We trained anomaly detection models in Databricks ML on 18 months of historical sensor data, correlated with maintenance logs. The models flag three specific failure precursors: bearing vibration pattern shifts (detectable 24-72 hours before failure), temperature drift on hydraulic systems (4-12 hours), and pressure anomalies in pneumatic actuators (8-48 hours). Alerts push to a custom React dashboard displayed on 55-inch shop floor monitors and simultaneously to maintenance team pagers via SNS.

$3.1M/yr Saved in unplanned downtime
4-72hr Advance failure warning
847 → 1 Sensors unified on one platform
90% → 35% Reactive maintenance (down from 90%)
AWS IoT Core Kinesis Data Streams S3 Databricks (Delta Lake + ML) SNS React

The OT/IT divide was the hardest problem we solved - and it wasn't a technology problem. The OT team controlled the production network and didn't trust IT (or us) to touch anything. We had to set up an isolated DMZ with a unidirectional data diode before they'd approve sensor data leaving the production network. Plan for 3-4 weeks of security review and network architecture work before you write a single line of code on a manufacturing project.


On the technical side, MQTT payloads from industrial sensors aren't standardized. We had 6 different sensor manufacturers, each with their own payload format and timestamp precision. Some sent epoch seconds, others sent ISO 8601 with timezone offsets, and two of them sent local time with no timezone at all. We built a normalization layer in IoT Core rules that transforms every payload into a common schema before it hits Kinesis. Don't skip this step - your downstream models will thank you.


The whiteboard is still there, by the way. But now it says "Days since last unplanned stoppage" instead of "Machines that sound funny." Last we checked, they were at 47 days.

Sitting on Sensor Data Nobody's Using?

If your historian servers are full of data and your maintenance team is still guessing, we should talk. We've done this across steel, automotive, and food manufacturing.

Start a Conversation