Web32025-12

Web3 Protocol: 99.95% Uptime Infrastructure for a DeFi API

A DeFi data aggregator serving 400 trading firms through a REST and WebSocket API had experienced three outages in four months, each lasting 40–90 minutes. Their infrastructure was two EC2 instances behind a load balancer with no auto-recovery. We rebuilt it on EKS with multi-AZ deployment, circuit breakers, and automated failover.

Deploy Time

Manual deploy, ~45 min, risky

Blue-green, 6 minutes, instant rollback

Deploy Frequency

Weekly (fear of deploying)

Daily

Incidents

3 outages in 4 months, avg 62-min MTTR

Zero outages in 6 months post-migration

Cost Impact

3 enterprise clients retained (would have churned)

The Challenge

Each outage had a different root cause: a memory leak in the Node.js WebSocket server, an EC2 instance in a bad AZ, and a manual deployment gone wrong. The common thread was no self-healing: when something went wrong, an engineer had to notice, SSH in, and manually restart. At 2am on a Sunday, the on-call was a 25-year-old founder. Three of their largest clients had issued 30-day termination notices after the third outage.

The Approach

The goal was infrastructure that recovers itself. We chose EKS with multi-AZ node groups, Kubernetes liveness probes for automatic pod restart, HPA for load-based scaling, and a blue-green deployment strategy to eliminate deployment-caused outages. The WebSocket memory leak was a separate application fix we scoped out - the infrastructure needed to work around it until the code was fixed.

The Implementation

EKS multi-AZ cluster with managed node groups

We provisioned an EKS cluster with managed node groups across three availability zones using Terraform. Node groups ran on m6i.large On-Demand instances with a minimum of 2 nodes per AZ - the previous two-instance setup was the root cause of the AZ failure. The AWS Node Group controller automatically replaces unhealthy nodes.

AWS EKSTerraformAWS Managed Node Groups

Liveness probes and automatic pod restart

We added a /live endpoint to the API that checks memory usage (fails if >85% of container limit) and WebSocket connection count. Kubernetes liveness probes check it every 10 seconds - if it fails three consecutive times, the pod is restarted. The memory leak now causes a clean restart instead of a frozen server.

KubernetesNode.jsPrometheus

Blue-green deployment with instant rollback

All deployments now go through a blue-green process: new version deployed to green, smoke tests run against green via an internal endpoint, then the Kubernetes Service selector switches traffic in under one second. The previous version remains running for 10 minutes as a rollback target. Deployment-caused outages dropped to zero.

KubernetesGitHub ActionsAWS ALB

Alerting with 2-minute MTTR target

We set up Prometheus scraping the API metrics endpoint, with Alertmanager routing critical alerts (error rate >2%, pod crash loop) directly to PagerDuty with a 5-minute escalation. We wrote runbooks for the four most common alerts. MTTR in the first month averaged 4 minutes - down from 62 minutes.

PrometheusAlertmanagerPagerDuty

Key Takeaways

Two instances behind a load balancer is not high availability - losing one AZ takes you to 50% capacity with no automatic recovery
Liveness probes that test application-specific health (not just HTTP 200) are the key to self-healing infrastructure
Blue-green deployment eliminates the deployment risk that causes the majority of outages at this infrastructure maturity level
A 5-minute PagerDuty escalation with a written runbook produces faster incident response than any hiring decision

Facing Similar Challenges?

Book a free 30-minute audit and I will tell you what I see.

Book Free Audit

All case studies