ClimateTech: Real-Time Carbon Data Pipeline on GCP Handling 40M Events/Day
A carbon accounting platform was ingesting IoT sensor data from 12,000 industrial facilities via batch CSV uploads processed nightly. Data latency meant enterprise customers could not act on emissions data until 24 hours after the fact. We redesigned the pipeline as a real-time streaming architecture on GCP, reducing data latency from 24 hours to under 90 seconds.
The Challenge
The nightly batch pipeline was a scheduled Python script that downloaded CSVs from S3, processed them in pandas, and loaded results into BigQuery. At 12,000 facilities sending data every 15 minutes, that was 48M rows per night - manageable in batch but impossible to make real-time without a proper streaming architecture. Three enterprise customers had specifically cited 'near-real-time carbon monitoring' as a requirement for contract renewal.
The Approach
We rebuilt the ingestion layer using Pub/Sub for message ingestion, Dataflow for stream processing, and BigQuery streaming inserts for storage. The existing batch pipeline ran in parallel during a 4-week transition period. Facility sensors needed no changes - only the destination endpoint changed.
The Implementation
Pub/Sub ingestion layer
We created a Pub/Sub topic per data type (energy, emissions, process-data) with 7-day message retention. Each facility's IoT gateway was reconfigured to POST sensor readings to a lightweight Cloud Run receiver that validated the payload and published to Pub/Sub. Backpressure and retries were handled by Pub/Sub - no message loss on downstream outages.
Dataflow streaming pipeline
We wrote a Dataflow pipeline using the Apache Beam Python SDK: ingest from Pub/Sub, validate schema, apply emissions factor calculations (configurable per facility type), deduplicate within a 5-minute window, and write to BigQuery streaming. The pipeline handled 40M events/day on 4 n1-standard-4 workers, auto-scaling up to 12 during peak hours.
BigQuery streaming and materialised views
Streaming inserts to BigQuery provided sub-second write latency. We built materialised views for the most common dashboard queries (hourly aggregates, facility comparisons, trend analysis) so that real-time data was available to the dashboard without full-table scans. Dashboard query time dropped from 45 seconds to 1.2 seconds.
Data quality monitoring and alerting
We built a data quality layer in Dataflow that flagged anomalous readings (sensor dropout, implausible values, duplicate facility IDs) to a Pub/Sub dead-letter topic. A monitoring service consumed the dead-letter topic and sent Slack alerts with facility ID, sensor type, and last-seen timestamp. Previously, data quality issues were discovered days later during batch validation.
Key Takeaways
- Pub/Sub as the ingestion buffer is the correct architecture for IoT data - it decouples ingestion from processing and handles bursty sensor data without dropping messages
- Dataflow auto-scaling means you pay for what you use - the same pipeline handles 10M events/day and 40M events/day without configuration changes
- BigQuery materialised views are the right optimisation for real-time dashboards - they trade storage for query latency with zero application changes
- Parallel running of old and new pipelines during transition is non-optional for data integrity - the 4-week overlap gave us confidence before switching customers
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.