How to monitor Docker containers in production?

docs.questions.sections.docker~4 min read

Monitoring Docker in production means knowing four things at all times: is the container alive, is it healthy, is it using too many resources, is it restarting too often. The tools for this are well-established; the work is wiring them together.

Theory

TL;DR

Three layers to monitor: the host (Linux metrics), the Docker daemon, the containers themselves.
Standard stack: cAdvisor (per-container metrics) + Prometheus (TSDB) + Grafana (dashboards) + Alertmanager (paging).
Logs are separate: Loki / ELK / fluentd / Datadog Logs.
What to alert on: unexpected restarts, healthcheck failures, OOM kills, CPU/memory saturation.
Critical insight: container restarts are signals — docker stats will not catch a flapping container that runs for 30s and dies.

What to measure

Host level:
  - CPU/memory/disk/network at host scale
  - Docker daemon uptime

Container level:
  - CPU usage (% of limit)
  - Memory usage (vs cgroup limit)
  - Network I/O
  - Block I/O
  - Restart count
  - Health status (healthy/unhealthy)
  - Uptime

Application level (inside the container):
  - HTTP latency / errors
  - Request rate
  - Custom business metrics

Docker-specific monitoring is layers 1-2. App-level metrics are exported BY the app (/metrics endpoint, Prometheus scraping).

The standard stack

cAdvisor — per-container metrics

Google's cAdvisor reads cgroup data and exports it as Prometheus metrics:

yaml

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    container_name: cadvisor
    privileged: true
    devices:
      - /dev/kmsg
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    restart: unless-stopped

Visit http://localhost:8080 for a quick UI; metrics at /metrics.

Key cAdvisor metrics:

container_cpu_usage_seconds_total — CPU consumed
container_memory_working_set_bytes — actual memory in use
container_network_receive_bytes_total / container_network_transmit_bytes_total — network I/O
container_fs_usage_bytes — disk usage

Prometheus — store and query

yaml

  prometheus:
    image: prom/prometheus:v2.55.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - promdata:/prometheus
    ports:
      - "9090:9090"
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d

yaml

# prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
  - job_name: 'docker-daemon'
    static_configs:
      - targets: ['host.docker.internal:9323']
  - job_name: 'app'
    static_configs:
      - targets: ['api:3000']
    metrics_path: /metrics

Prometheus scrapes every 15 seconds and stores 30 days of metrics.

Grafana — dashboards

yaml

  grafana:
    image: grafana/grafana:11.3.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana:/var/lib/grafana

Import pre-built dashboards: "Docker and System Monitoring" (ID: 893), "Docker Container & Host Metrics" (ID: 10619). Five clicks and you have full container observability.

Alertmanager — paging

yaml

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro

Define alerts in Prometheus, route through Alertmanager to Slack/PagerDuty/email.

Essential alerts

yaml

# rules.yml
groups:
- name: docker
  rules:
  - alert: ContainerDown
    expr: time() - container_last_seen{name!=""} > 300
    for: 5m
    annotations:
      summary: "Container {{ $labels.name }} not seen for 5 minutes"

  - alert: ContainerHighMemory
    expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
    for: 10m
    annotations:
      summary: "Container {{ $labels.name }} above 90% memory"

  - alert: ContainerOOMKilled
    expr: increase(container_oom_events_total[5m]) > 0
    annotations:
      summary: "Container {{ $labels.name }} OOM-killed in last 5 minutes"

  - alert: ContainerRestarting
    expr: increase(container_start_time_seconds[15m]) > 2
    annotations:
      summary: "Container {{ $labels.name }} restarted >2 times in 15 minutes"

  - alert: ContainerUnhealthy
    expr: container_health_status == 0   # 0 = unhealthy
    for: 2m
    annotations:
      summary: "Container {{ $labels.name }} unhealthy for 2 minutes"

These five cover most production failure modes.

Logs

Metrics tell you something is wrong; logs tell you why. Standard production stacks:

Loki + Promtail + Grafana — Prometheus-flavored, log labels match metric labels.
ELK (Elasticsearch + Logstash + Kibana) — heavyweight but powerful search.
Fluentd / Fluent Bit — log collector, ships to anywhere.
Vector — modern alternative to fluentd, lower overhead.

At the Docker level, configure log drivers:

yaml

services:
  api:
    image: myapp
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
        # Also: tag, labels for centralized routing

Without max-size, default json-file logs grow forever and fill the disk.

Health-driven monitoring

If your containers have healthcheck: defined, Prometheus / cAdvisor expose container_health_status. Alert on it. The healthcheck logic itself is your liveness probe.

yaml

services:
  api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      retries: 3

No healthcheck = monitoring blind spot.

Common mistakes

No alerts at all

A dashboard nobody watches is not monitoring. Set Alertmanager + paging from day one. "We will check the dashboard" never works.

Logs without retention or rotation

yaml

services:
  api:
    image: myapp
    # default json-file logs grow without bound

After 6 months, /var/lib/docker/containers/<id>/*-json.log can be hundreds of GBs. Always set max-size and max-file.

Monitoring host metrics but not container metrics

A host with 50% CPU could be one container at 100% CPU. Per-container metrics let you find the noisy neighbor.

Forgetting to scrape the daemon itself

The Docker daemon exposes Prometheus metrics if enabled (/etc/docker/daemon.json: { "metrics-addr": "0.0.0.0:9323", "experimental": true }). Daemon-level metrics show daemon health, image push/pull rates, etc.

Ignoring restart count

A container with --restart=unless-stopped that flaps every 30 seconds looks "up" most of the time but is broken. container_start_time_seconds change rate catches this.

Real-world usage

Small/medium teams: the Compose-based cAdvisor + Prometheus + Grafana + Loki stack. ~30 minutes to set up, covers 90% of needs.
Cloud providers: AWS CloudWatch Container Insights, GCP Cloud Monitoring, Azure Container Insights. Managed, no setup, billed per metric.
Datadog / New Relic / Honeycomb: SaaS APM with Docker integration. Pay for convenience.
Kubernetes-style: Prometheus Operator + kube-state-metrics + node-exporter. Same idea, K8s-native.

Follow-up questions

Q: Can I use docker stats for production monitoring?

A: No. It is a live snapshot, not a TSDB. Useful for ad-hoc inspection ("why is this slow right now?"), useless for trends or alerts.

Q: Why use cAdvisor instead of just docker stats?

A: cAdvisor exposes a Prometheus endpoint, so metrics get scraped, stored, and queryable historically. docker stats does not.

Q: Does Docker daemon itself expose metrics?

A: Yes, if you enable it: /etc/docker/daemon.json with "metrics-addr": "0.0.0.0:9323". Then scrape host:9323/metrics.

Q: What is the difference between metrics and logs?

A: Metrics are aggregates over time (CPU 75% at t=12:00). Logs are events ("GET /api/users 200 in 12ms at t=12:00:01"). Both needed; metrics for alerts and trends, logs for debugging the cause.

Q: (Senior) How do you correlate metrics, logs, and traces in production Docker?

A: Add labels everywhere: container labels (com.docker.stack=myapp), match those in cAdvisor metrics, propagate via Loki labels for logs, and add OpenTelemetry trace IDs to log lines. Grafana's "Explore" view lets you click a metric spike, jump to logs at that timestamp with the same labels, then jump to a trace by trace ID. The infrastructure: cAdvisor + Prometheus + Loki + Tempo behind one Grafana. The hard part: instrumenting the app well, not the deployment.

Examples

Compose-based monitoring stack

yaml

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    privileged: true
    devices: ["/dev/kmsg"]
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    networks: [monitor]

  prometheus:
    image: prom/prometheus:v2.55.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules.yml:/etc/prometheus/rules.yml:ro
      - promdata:/prometheus
    ports: ["9090:9090"]
    networks: [monitor]

  grafana:
    image: grafana/grafana:11.3.0
    ports: ["3001:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana:/var/lib/grafana
    networks: [monitor]

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports: ["9093:9093"]
    networks: [monitor]

  loki:
    image: grafana/loki:3.2.0
    ports: ["3100:3100"]
    networks: [monitor]

  promtail:
    image: grafana/promtail:3.2.0
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail.yml:/etc/promtail/config.yml:ro
    networks: [monitor]

volumes:
  promdata:
  grafana:

networks:
  monitor:

Four pillars (metrics from cAdvisor, store in Prometheus, alert via Alertmanager, dashboards in Grafana) plus log shipping (Promtail → Loki). One docker compose up.

Setting log rotation everywhere

yaml

# In your app stack's compose.yaml
x-logging: &default-logging
  driver: json-file
  options:
    max-size: "10m"
    max-file: "3"
    tag: "{{.Name}}"

services:
  api:
    image: myapp
    logging: *default-logging
  worker:
    image: myworker
    logging: *default-logging

YAML anchors apply the same logging config across services. Disk usage stays bounded.

Daemon-level metrics endpoint

json

# /etc/docker/daemon.json
{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}

bash

sudo systemctl restart docker
curl http://localhost:9323/metrics | head
# # HELP engine_daemon_engine_info ...
# # TYPE engine_daemon_engine_info gauge
# ...

Now Prometheus can scrape the daemon at host:9323 for engine-level metrics.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet