Skip to content
DevOpsMarch 3, 20266 min read

Prometheus + Grafana + Loki: Production Monitoring for Kubernetes in 2026

Most startup Kubernetes clusters have monitoring that detects problems after users report them. Here's how to set up the full observability stack - metrics, logs, and alerts - that catches issues before they reach users.

"Monitoring" at most startups means: someone sets up Datadog, the default alerts fire constantly, the team mutes them, and the first sign of a real incident is a user complaint on Intercom.

Effective monitoring answers three questions at all times:

  1. Is my service healthy right now?
  2. When did it stop being healthy?
  3. What caused it?

Prometheus (metrics), Loki (logs), and Grafana (dashboards and alerts) answer all three. Here is the complete setup.

The Stack

Prometheus - time-series metrics database. Scrapes /metrics endpoints from your services and Kubernetes itself. Stores data locally.

Alertmanager - receives alerts from Prometheus rules, routes them to Slack/PagerDuty/email, handles deduplication and grouping.

Grafana - dashboards and alert visualisation. Queries Prometheus and Loki.

Loki - log aggregation. Promtail (or the Grafana Alloy agent) ships logs from all pods to Loki. Loki indexes only metadata (labels), not the log content - this keeps costs low.

Kube-state-metrics + node-exporter - expose cluster-level metrics (pod status, node resources) that Prometheus scrapes.

Installation via kube-prometheus-stack

The fastest way to get everything running is the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --values monitoring-values.yaml

monitoring-values.yaml:

yaml
# Grafana grafana: enabled: true ingress: enabled: true hosts: - grafana.yourdomain.com tls: - secretName: grafana-tls hosts: - grafana.yourdomain.com persistence: enabled: true size: 10Gi adminPassword: "" # use a secret, not a plaintext value grafana.ini: auth.anonymous: enabled: false # Prometheus prometheus: prometheusSpec: retention: 15d storageSpec: volumeClaimTemplate: spec: resources: requests: storage: 50Gi # Scrape all ServiceMonitors in the cluster serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false # Alertmanager alertmanager: config: global: resolve_timeout: 5m route: group_by: [alertname, cluster, namespace] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: slack routes: - match: severity: critical receiver: pagerduty receivers: - name: slack slack_configs: - api_url: "$SLACK_WEBHOOK_URL" channel: "#alerts" send_resolved: true title: '{{ .Status | toUpper }} | {{ .CommonLabels.alertname }}' text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}" - name: pagerduty pagerduty_configs: - routing_key: "$PAGERDUTY_KEY" description: '{{ .CommonLabels.alertname }}'

Installing Loki

Install Loki separately with Promtail for log shipping:

bash
helm repo add grafana https://grafana.github.io/helm-charts helm upgrade --install loki grafana/loki-stack \ --namespace monitoring \ --set loki.persistence.enabled=true \ --set loki.persistence.size=50Gi \ --set promtail.enabled=true \ --set grafana.enabled=false # already installed above

Add Loki as a data source in Grafana - it will appear at http://loki.monitoring.svc:3100.

Instrumenting Your Application

Your application needs to expose a /metrics endpoint for Prometheus to scrape. In Node.js:

javascript
const promClient = require('prom-client'); // Enable default metrics (event loop lag, GC, memory, etc.) promClient.collectDefaultMetrics(); // Custom business metrics const httpRequestDuration = new promClient.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'route', 'status_code'], buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5], }); const activeConnections = new promClient.Gauge({ name: 'http_active_connections', help: 'Number of active HTTP connections', }); // Middleware app.use((req, res, next) => { const end = httpRequestDuration.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode }); }); next(); }); app.get('/metrics', async (req, res) => { res.set('Content-Type', promClient.register.contentType); res.end(await promClient.register.metrics()); });

Tell Prometheus to scrape it via a ServiceMonitor:

yaml
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api namespace: production labels: release: kube-prometheus-stack # must match the helm release label spec: selector: matchLabels: app: api endpoints: - port: http path: /metrics interval: 15s

The Alerts That Actually Matter

Most default Prometheus alerts are noise. These are the ones that matter for a startup:

yaml
groups: - name: startup.rules rules: # High error rate - pages immediately - alert: HighErrorRate expr: | sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) > 0.05 for: 2m labels: severity: critical annotations: summary: "Error rate above 5%" description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes." # High latency - p99 above 2 seconds - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2 for: 5m labels: severity: warning annotations: summary: "p99 latency above 2s on {{ $labels.route }}" # Pod crash-looping - alert: PodCrashLooping expr: | rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping" # Deployment stuck - alert: DeploymentRolloutStuck expr: | kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 15m labels: severity: warning annotations: summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck" # Node disk pressure - alert: NodeDiskPressure expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15 for: 10m labels: severity: warning annotations: summary: "Node {{ $labels.instance }} disk usage above 85%" # High memory - alert: ContainerHighMemory expr: | container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.container }} in {{ $labels.namespace }} at 90% memory limit"

Querying Logs in Grafana

Once Loki is connected, use LogQL to search logs. Useful queries:

logql
# All errors from the API pod in the last hour {namespace="production", app="api"} |= "ERROR" # Slow requests (over 1 second) {namespace="production", app="api"} | json | duration > 1s # Recent exceptions {namespace="production"} |= "Exception" | line_format "{{.message}}"

Building Useful Dashboards

Import these community dashboards by ID in Grafana (Dashboards → Import):

  • Node Exporter Full: 1860 - CPU, memory, disk, network per node
  • Kubernetes Cluster Monitoring: 315 - cluster-wide resource usage
  • Kubernetes Deployment Statefulset Daemonset: 8588 - workload status
  • NGINX Ingress Controller: 9614 - request rate, latency, errors at the ingress

Add a custom dashboard for your application with:

  • Request rate (rps) over time
  • Error rate %
  • p50, p90, p99 latency
  • Active pod count
  • Last deployment time (from a gauge backed by a custom metric)

The SLA Dashboard

For any service with an SLA:

promql
# 30-day availability (% of time with error rate < 1%) avg_over_time( ( sum(rate(http_request_duration_seconds_count{status_code!~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) )[30d:5m] )

Put this on the main engineering dashboard. When leadership asks "what is our uptime?", you have a number backed by real data.


Running Kubernetes without production-grade monitoring? Book a free audit - we will review your current observability setup and identify what you are flying blind on.

RK
RKSSH LLP
DevOps Engineer · rkssh.com

I help funded startups fix their CI/CD pipelines and Kubernetes infrastructure. If this post was useful and you want to talk through your specific situation, book a free 30-minute audit.

Related Articles