ClimateTech2025-05

ClimateTech: Real-Time Carbon Data Pipeline on GCP Handling 40M Events/Day

A carbon accounting platform was ingesting IoT sensor data from 12,000 industrial facilities via batch CSV uploads processed nightly. Data latency meant enterprise customers could not act on emissions data until 24 hours after the fact. We redesigned the pipeline as a real-time streaming architecture on GCP, reducing data latency from 24 hours to under 90 seconds.

Deploy Time

N/A

Deploy Frequency

N/A

Incidents

Data latency: 24 hours, batch processing

Data latency: 90 seconds, real-time streaming

Cost Impact

Three enterprise renewals secured ($1.8M ARR protected)

The Challenge

The nightly batch pipeline was a scheduled Python script that downloaded CSVs from S3, processed them in pandas, and loaded results into BigQuery. At 12,000 facilities sending data every 15 minutes, that was 48M rows per night - manageable in batch but impossible to make real-time without a proper streaming architecture. Three enterprise customers had specifically cited 'near-real-time carbon monitoring' as a requirement for contract renewal.

The Approach

We rebuilt the ingestion layer using Pub/Sub for message ingestion, Dataflow for stream processing, and BigQuery streaming inserts for storage. The existing batch pipeline ran in parallel during a 4-week transition period. Facility sensors needed no changes - only the destination endpoint changed.

The Implementation

Pub/Sub ingestion layer

We created a Pub/Sub topic per data type (energy, emissions, process-data) with 7-day message retention. Each facility's IoT gateway was reconfigured to POST sensor readings to a lightweight Cloud Run receiver that validated the payload and published to Pub/Sub. Backpressure and retries were handled by Pub/Sub - no message loss on downstream outages.

Google Cloud Pub/SubCloud RunPython

Dataflow streaming pipeline

We wrote a Dataflow pipeline using the Apache Beam Python SDK: ingest from Pub/Sub, validate schema, apply emissions factor calculations (configurable per facility type), deduplicate within a 5-minute window, and write to BigQuery streaming. The pipeline handled 40M events/day on 4 n1-standard-4 workers, auto-scaling up to 12 during peak hours.

Google Cloud DataflowApache BeamBigQueryPython

BigQuery streaming and materialised views

Streaming inserts to BigQuery provided sub-second write latency. We built materialised views for the most common dashboard queries (hourly aggregates, facility comparisons, trend analysis) so that real-time data was available to the dashboard without full-table scans. Dashboard query time dropped from 45 seconds to 1.2 seconds.

BigQueryBigQuery Materialised ViewsLooker Studio

Data quality monitoring and alerting

We built a data quality layer in Dataflow that flagged anomalous readings (sensor dropout, implausible values, duplicate facility IDs) to a Pub/Sub dead-letter topic. A monitoring service consumed the dead-letter topic and sent Slack alerts with facility ID, sensor type, and last-seen timestamp. Previously, data quality issues were discovered days later during batch validation.

Cloud Pub/SubDataflowCloud FunctionsSlack

Key Takeaways

Pub/Sub as the ingestion buffer is the correct architecture for IoT data - it decouples ingestion from processing and handles bursty sensor data without dropping messages
Dataflow auto-scaling means you pay for what you use - the same pipeline handles 10M events/day and 40M events/day without configuration changes
BigQuery materialised views are the right optimisation for real-time dashboards - they trade storage for query latency with zero application changes
Parallel running of old and new pipelines during transition is non-optional for data integrity - the 4-week overlap gave us confidence before switching customers

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies