EdTech2025-05

EdTech Platform: Zero Visibility to Full Observability in 3 Weeks

A 400K-user online learning platform had no application monitoring beyond uptime checks. They discovered a critical performance regression when student complaints spiked - three days after it started. We built a full observability stack in three weeks.

Deploy Time

N/A

Deploy Frequency

N/A

Incidents

MTTR: 3 days avg

MTTR: 22 minutes avg

Cost Impact

3 production incidents resolved before user impact in first month

The Challenge

The engineering team was flying blind. No structured logs, no distributed tracing, no SLOs. When the platform slowed down during live sessions, the team had no way to identify whether the problem was the video streaming layer, the Rails API, the PostgreSQL database, or the CDN. Incidents were diagnosed by guessing and checking - average MTTR was over 3 days. The team had grown to 20 engineers but still operated without the tooling needed to own production responsibly.

The Approach

We ran two parallel tracks: instrument the application with structured logging and tracing, and build the dashboards and alerts that turn raw signals into actionable information. We scoped out full APM - the goal was opinionated, fast, and right-sized for the team's maturity.

The Implementation

Structured logging with OpenTelemetry

We instrumented the Rails API with the OpenTelemetry Ruby SDK, replacing ad-hoc Rails.logger calls with structured JSON logs emitting request_id, user_id, duration_ms, and error context. Logs shipped to Loki via the OpenTelemetry Collector.

OpenTelemetryLokiGrafana

Distributed tracing across services

Trace context propagated from the Rails API to the background job workers (Sidekiq) and the video transcoding service. We used Tempo as the trace backend. P95 and P99 latency became visible per endpoint for the first time.

TempoGrafanaSidekiq

SLO dashboards and alerting

We defined three SLOs: API availability (99.5%), session start latency (P95 < 2s), and video load time (P95 < 4s). Grafana dashboards surfaced error budgets. PagerDuty alerts fired when the error budget burn rate exceeded 5×.

GrafanaPrometheusPagerDuty

Runbook library

For the five most common alert types, we wrote structured runbooks linked directly from Grafana panel titles. Each runbook: alert meaning, likely causes, diagnostic queries, and resolution steps. First responders stopped guessing.

GrafanaNotionPagerDuty

Key Takeaways

Structured logging is the highest-leverage first step - unstructured logs do not scale past 5 engineers
Trace context across async workers (Sidekiq) was the single change that made the hardest bugs diagnosable
SLO-based alerting eliminates noise - the team went from 40+ alerts/week to 6 actionable alerts/week
Runbooks linked from alerts cut time-to-resolution more than any tooling change

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies