EdTech Platform: Zero Visibility to Full Observability in 3 Weeks
A 400K-user online learning platform had no application monitoring beyond uptime checks. They discovered a critical performance regression when student complaints spiked - three days after it started. We built a full observability stack in three weeks.
The Challenge
The engineering team was flying blind. No structured logs, no distributed tracing, no SLOs. When the platform slowed down during live sessions, the team had no way to identify whether the problem was the video streaming layer, the Rails API, the PostgreSQL database, or the CDN. Incidents were diagnosed by guessing and checking - average MTTR was over 3 days. The team had grown to 20 engineers but still operated without the tooling needed to own production responsibly.
The Approach
We ran two parallel tracks: instrument the application with structured logging and tracing, and build the dashboards and alerts that turn raw signals into actionable information. We scoped out full APM - the goal was opinionated, fast, and right-sized for the team's maturity.
The Implementation
Structured logging with OpenTelemetry
We instrumented the Rails API with the OpenTelemetry Ruby SDK, replacing ad-hoc Rails.logger calls with structured JSON logs emitting request_id, user_id, duration_ms, and error context. Logs shipped to Loki via the OpenTelemetry Collector.
Distributed tracing across services
Trace context propagated from the Rails API to the background job workers (Sidekiq) and the video transcoding service. We used Tempo as the trace backend. P95 and P99 latency became visible per endpoint for the first time.
SLO dashboards and alerting
We defined three SLOs: API availability (99.5%), session start latency (P95 < 2s), and video load time (P95 < 4s). Grafana dashboards surfaced error budgets. PagerDuty alerts fired when the error budget burn rate exceeded 5×.
Runbook library
For the five most common alert types, we wrote structured runbooks linked directly from Grafana panel titles. Each runbook: alert meaning, likely causes, diagnostic queries, and resolution steps. First responders stopped guessing.
Key Takeaways
- Structured logging is the highest-leverage first step - unstructured logs do not scale past 5 engineers
- Trace context across async workers (Sidekiq) was the single change that made the hardest bugs diagnosable
- SLO-based alerting eliminates noise - the team went from 40+ alerts/week to 6 actionable alerts/week
- Runbooks linked from alerts cut time-to-resolution more than any tooling change
Facing Similar Challenges?
Book a free 30-minute audit and I will tell you what I see.