Prometheus + Grafana + Loki: Production Monitoring for Kubernetes in 2026

"Monitoring" at most startups means: someone sets up Datadog, the default alerts fire constantly, the team mutes them, and the first sign of a real incident is a user complaint on Intercom.

Effective monitoring answers three questions at all times:

•Is my service healthy right now?
•When did it stop being healthy?
•What caused it?

Prometheus (metrics), Loki (logs), and Grafana (dashboards and alerts) answer all three. Here is the complete setup.

The Stack

Prometheus - time-series metrics database. Scrapes /metrics endpoints from your services and Kubernetes itself. Stores data locally.

Alertmanager - receives alerts from Prometheus rules, routes them to Slack/PagerDuty/email, handles deduplication and grouping.

Grafana - dashboards and alert visualisation. Queries Prometheus and Loki.

Loki - log aggregation. Promtail (or the Grafana Alloy agent) ships logs from all pods to Loki. Loki indexes only metadata (labels), not the log content - this keeps costs low.

Kube-state-metrics + node-exporter - expose cluster-level metrics (pod status, node resources) that Prometheus scrapes.

Installation via kube-prometheus-stack

The fastest way to get everything running is the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values monitoring-values.yaml

monitoring-values.yaml:

yaml
# Grafana
grafana:
  enabled: true
  ingress:
    enabled: true
    hosts:
      - grafana.yourdomain.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.yourdomain.com
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: ""   # use a secret, not a plaintext value
  grafana.ini:
    auth.anonymous:
      enabled: false

# Prometheus
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 50Gi
    # Scrape all ServiceMonitors in the cluster
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

# Alertmanager
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: [alertname, cluster, namespace]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: slack
      routes:
        - match:
            severity: critical
          receiver: pagerduty
    receivers:
      - name: slack
        slack_configs:
          - api_url: "$SLACK_WEBHOOK_URL"
            channel: "#alerts"
            send_resolved: true
            title: '{{ .Status | toUpper }} | {{ .CommonLabels.alertname }}'
            text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
      - name: pagerduty
        pagerduty_configs:
          - routing_key: "$PAGERDUTY_KEY"
            description: '{{ .CommonLabels.alertname }}'

Installing Loki

Install Loki separately with Promtail for log shipping:

bash
helm repo add grafana https://grafana.github.io/helm-charts

helm upgrade --install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi \
  --set promtail.enabled=true \
  --set grafana.enabled=false   # already installed above

Add Loki as a data source in Grafana - it will appear at http://loki.monitoring.svc:3100.

Instrumenting Your Application

Your application needs to expose a /metrics endpoint for Prometheus to scrape. In Node.js:

javascript
const promClient = require('prom-client');

// Enable default metrics (event loop lag, GC, memory, etc.)
promClient.collectDefaultMetrics();

// Custom business metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});

const activeConnections = new promClient.Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections',
});

// Middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Tell Prometheus to scrape it via a ServiceMonitor:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api
  namespace: production
  labels:
    release: kube-prometheus-stack   # must match the helm release label
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

The Alerts That Actually Matter

Most default Prometheus alerts are noise. These are the ones that matter for a startup:

yaml
groups:
  - name: startup.rules
    rules:
      # High error rate - pages immediately
      - alert: HighErrorRate
        expr: |
          sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."

      # High latency - p99 above 2 seconds
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2s on {{ $labels.route }}"

      # Pod crash-looping
      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"

      # Deployment stuck
      - alert: DeploymentRolloutStuck
        expr: |
          kube_deployment_status_observed_generation != kube_deployment_metadata_generation
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck"

      # Node disk pressure
      - alert: NodeDiskPressure
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk usage above 85%"

      # High memory
      - alert: ContainerHighMemory
        expr: |
          container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} in {{ $labels.namespace }} at 90% memory limit"

Querying Logs in Grafana

Once Loki is connected, use LogQL to search logs. Useful queries:

logql
# All errors from the API pod in the last hour
{namespace="production", app="api"} |= "ERROR"

# Slow requests (over 1 second)
{namespace="production", app="api"} | json | duration > 1s

# Recent exceptions
{namespace="production"} |= "Exception" | line_format "{{.message}}"

Building Useful Dashboards

Import these community dashboards by ID in Grafana (Dashboards → Import):

•Node Exporter Full: 1860 - CPU, memory, disk, network per node
•Kubernetes Cluster Monitoring: 315 - cluster-wide resource usage
•Kubernetes Deployment Statefulset Daemonset: 8588 - workload status
•NGINX Ingress Controller: 9614 - request rate, latency, errors at the ingress

Add a custom dashboard for your application with:

•Request rate (rps) over time
•Error rate %
•p50, p90, p99 latency
•Active pod count
•Last deployment time (from a gauge backed by a custom metric)

The SLA Dashboard

For any service with an SLA:

promql
# 30-day availability (% of time with error rate < 1%)
avg_over_time(
  (
    sum(rate(http_request_duration_seconds_count{status_code!~"5.."}[5m]))
    /
    sum(rate(http_request_duration_seconds_count[5m]))
  )[30d:5m]
)

Put this on the main engineering dashboard. When leadership asks "what is our uptime?", you have a number backed by real data.

Running Kubernetes without production-grade monitoring? Book a free audit - we will review your current observability setup and identify what you are flying blind on.