"Monitoring" at most startups means: someone sets up Datadog, the default alerts fire constantly, the team mutes them, and the first sign of a real incident is a user complaint on Intercom.
Effective monitoring answers three questions at all times:
- •Is my service healthy right now?
- •When did it stop being healthy?
- •What caused it?
Prometheus (metrics), Loki (logs), and Grafana (dashboards and alerts) answer all three. Here is the complete setup.
The Stack
Prometheus - time-series metrics database. Scrapes /metrics endpoints from your services and Kubernetes itself. Stores data locally.
Alertmanager - receives alerts from Prometheus rules, routes them to Slack/PagerDuty/email, handles deduplication and grouping.
Grafana - dashboards and alert visualisation. Queries Prometheus and Loki.
Loki - log aggregation. Promtail (or the Grafana Alloy agent) ships logs from all pods to Loki. Loki indexes only metadata (labels), not the log content - this keeps costs low.
Kube-state-metrics + node-exporter - expose cluster-level metrics (pod status, node resources) that Prometheus scrapes.
Installation via kube-prometheus-stack
The fastest way to get everything running is the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter:
bashhelm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --values monitoring-values.yaml
monitoring-values.yaml:
yaml# Grafana grafana: enabled: true ingress: enabled: true hosts: - grafana.yourdomain.com tls: - secretName: grafana-tls hosts: - grafana.yourdomain.com persistence: enabled: true size: 10Gi adminPassword: "" # use a secret, not a plaintext value grafana.ini: auth.anonymous: enabled: false # Prometheus prometheus: prometheusSpec: retention: 15d storageSpec: volumeClaimTemplate: spec: resources: requests: storage: 50Gi # Scrape all ServiceMonitors in the cluster serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false # Alertmanager alertmanager: config: global: resolve_timeout: 5m route: group_by: [alertname, cluster, namespace] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: slack routes: - match: severity: critical receiver: pagerduty receivers: - name: slack slack_configs: - api_url: "$SLACK_WEBHOOK_URL" channel: "#alerts" send_resolved: true title: '{{ .Status | toUpper }} | {{ .CommonLabels.alertname }}' text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}" - name: pagerduty pagerduty_configs: - routing_key: "$PAGERDUTY_KEY" description: '{{ .CommonLabels.alertname }}'
Installing Loki
Install Loki separately with Promtail for log shipping:
bashhelm repo add grafana https://grafana.github.io/helm-charts helm upgrade --install loki grafana/loki-stack \ --namespace monitoring \ --set loki.persistence.enabled=true \ --set loki.persistence.size=50Gi \ --set promtail.enabled=true \ --set grafana.enabled=false # already installed above
Add Loki as a data source in Grafana - it will appear at http://loki.monitoring.svc:3100.
Instrumenting Your Application
Your application needs to expose a /metrics endpoint for Prometheus to scrape. In Node.js:
javascriptconst promClient = require('prom-client'); // Enable default metrics (event loop lag, GC, memory, etc.) promClient.collectDefaultMetrics(); // Custom business metrics const httpRequestDuration = new promClient.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'route', 'status_code'], buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5], }); const activeConnections = new promClient.Gauge({ name: 'http_active_connections', help: 'Number of active HTTP connections', }); // Middleware app.use((req, res, next) => { const end = httpRequestDuration.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode }); }); next(); }); app.get('/metrics', async (req, res) => { res.set('Content-Type', promClient.register.contentType); res.end(await promClient.register.metrics()); });
Tell Prometheus to scrape it via a ServiceMonitor:
yamlapiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api namespace: production labels: release: kube-prometheus-stack # must match the helm release label spec: selector: matchLabels: app: api endpoints: - port: http path: /metrics interval: 15s
The Alerts That Actually Matter
Most default Prometheus alerts are noise. These are the ones that matter for a startup:
yamlgroups: - name: startup.rules rules: # High error rate - pages immediately - alert: HighErrorRate expr: | sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) > 0.05 for: 2m labels: severity: critical annotations: summary: "Error rate above 5%" description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes." # High latency - p99 above 2 seconds - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2 for: 5m labels: severity: warning annotations: summary: "p99 latency above 2s on {{ $labels.route }}" # Pod crash-looping - alert: PodCrashLooping expr: | rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping" # Deployment stuck - alert: DeploymentRolloutStuck expr: | kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 15m labels: severity: warning annotations: summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck" # Node disk pressure - alert: NodeDiskPressure expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15 for: 10m labels: severity: warning annotations: summary: "Node {{ $labels.instance }} disk usage above 85%" # High memory - alert: ContainerHighMemory expr: | container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.container }} in {{ $labels.namespace }} at 90% memory limit"
Querying Logs in Grafana
Once Loki is connected, use LogQL to search logs. Useful queries:
logql# All errors from the API pod in the last hour {namespace="production", app="api"} |= "ERROR" # Slow requests (over 1 second) {namespace="production", app="api"} | json | duration > 1s # Recent exceptions {namespace="production"} |= "Exception" | line_format "{{.message}}"
Building Useful Dashboards
Import these community dashboards by ID in Grafana (Dashboards → Import):
- •Node Exporter Full:
1860- CPU, memory, disk, network per node - •Kubernetes Cluster Monitoring:
315- cluster-wide resource usage - •Kubernetes Deployment Statefulset Daemonset:
8588- workload status - •NGINX Ingress Controller:
9614- request rate, latency, errors at the ingress
Add a custom dashboard for your application with:
- •Request rate (rps) over time
- •Error rate %
- •p50, p90, p99 latency
- •Active pod count
- •Last deployment time (from a gauge backed by a custom metric)
The SLA Dashboard
For any service with an SLA:
promql# 30-day availability (% of time with error rate < 1%) avg_over_time( ( sum(rate(http_request_duration_seconds_count{status_code!~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) )[30d:5m] )
Put this on the main engineering dashboard. When leadership asks "what is our uptime?", you have a number backed by real data.
Running Kubernetes without production-grade monitoring? Book a free audit - we will review your current observability setup and identify what you are flying blind on.