Skip to main content

How to monitor Docker containers in production?

Monitoring Docker in production means knowing four things at all times: is the container alive, is it healthy, is it using too many resources, is it restarting too often. The tools for this are well-established; the work is wiring them together.

Theory

TL;DR

  • Three layers to monitor: the host (Linux metrics), the Docker daemon, the containers themselves.
  • Standard stack: cAdvisor (per-container metrics) + Prometheus (TSDB) + Grafana (dashboards) + Alertmanager (paging).
  • Logs are separate: Loki / ELK / fluentd / Datadog Logs.
  • What to alert on: unexpected restarts, healthcheck failures, OOM kills, CPU/memory saturation.
  • Critical insight: container restarts are signals — docker stats will not catch a flapping container that runs for 30s and dies.

What to measure

Host level: - CPU/memory/disk/network at host scale - Docker daemon uptime Container level: - CPU usage (% of limit) - Memory usage (vs cgroup limit) - Network I/O - Block I/O - Restart count - Health status (healthy/unhealthy) - Uptime Application level (inside the container): - HTTP latency / errors - Request rate - Custom business metrics

Docker-specific monitoring is layers 1-2. App-level metrics are exported BY the app (/metrics endpoint, Prometheus scraping).

The standard stack

cAdvisor — per-container metrics

Google's cAdvisor reads cgroup data and exports it as Prometheus metrics:

yaml
services: cadvisor: image: gcr.io/cadvisor/cadvisor:v0.49.0 container_name: cadvisor privileged: true devices: - /dev/kmsg ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro - /dev/disk:/dev/disk:ro restart: unless-stopped

Visit http://localhost:8080 for a quick UI; metrics at /metrics.

Key cAdvisor metrics:

  • container_cpu_usage_seconds_total — CPU consumed
  • container_memory_working_set_bytes — actual memory in use
  • container_network_receive_bytes_total / container_network_transmit_bytes_total — network I/O
  • container_fs_usage_bytes — disk usage

Prometheus — store and query

yaml
prometheus: image: prom/prometheus:v2.55.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro - promdata:/prometheus ports: - "9090:9090" command: - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.retention.time=30d
yaml
# prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] - job_name: 'docker-daemon' static_configs: - targets: ['host.docker.internal:9323'] - job_name: 'app' static_configs: - targets: ['api:3000'] metrics_path: /metrics

Prometheus scrapes every 15 seconds and stores 30 days of metrics.

Grafana — dashboards

yaml
grafana: image: grafana/grafana:11.3.0 ports: - "3000:3000" environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD} volumes: - grafana:/var/lib/grafana

Import pre-built dashboards: "Docker and System Monitoring" (ID: 893), "Docker Container & Host Metrics" (ID: 10619). Five clicks and you have full container observability.

Alertmanager — paging

yaml
alertmanager: image: prom/alertmanager:v0.27.0 ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro

Define alerts in Prometheus, route through Alertmanager to Slack/PagerDuty/email.

Essential alerts

yaml
# rules.yml groups: - name: docker rules: - alert: ContainerDown expr: time() - container_last_seen{name!=""} > 300 for: 5m annotations: summary: "Container {{ $labels.name }} not seen for 5 minutes" - alert: ContainerHighMemory expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9 for: 10m annotations: summary: "Container {{ $labels.name }} above 90% memory" - alert: ContainerOOMKilled expr: increase(container_oom_events_total[5m]) > 0 annotations: summary: "Container {{ $labels.name }} OOM-killed in last 5 minutes" - alert: ContainerRestarting expr: increase(container_start_time_seconds[15m]) > 2 annotations: summary: "Container {{ $labels.name }} restarted >2 times in 15 minutes" - alert: ContainerUnhealthy expr: container_health_status == 0 # 0 = unhealthy for: 2m annotations: summary: "Container {{ $labels.name }} unhealthy for 2 minutes"

These five cover most production failure modes.

Logs

Metrics tell you something is wrong; logs tell you why. Standard production stacks:

  • Loki + Promtail + Grafana — Prometheus-flavored, log labels match metric labels.
  • ELK (Elasticsearch + Logstash + Kibana) — heavyweight but powerful search.
  • Fluentd / Fluent Bit — log collector, ships to anywhere.
  • Vector — modern alternative to fluentd, lower overhead.

At the Docker level, configure log drivers:

yaml
services: api: image: myapp logging: driver: json-file options: max-size: "10m" max-file: "3" # Also: tag, labels for centralized routing

Without max-size, default json-file logs grow forever and fill the disk.

Health-driven monitoring

If your containers have healthcheck: defined, Prometheus / cAdvisor expose container_health_status. Alert on it. The healthcheck logic itself is your liveness probe.

yaml
services: api: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 30s retries: 3

No healthcheck = monitoring blind spot.

Common mistakes

No alerts at all

A dashboard nobody watches is not monitoring. Set Alertmanager + paging from day one. "We will check the dashboard" never works.

Logs without retention or rotation

yaml
services: api: image: myapp # default json-file logs grow without bound

After 6 months, /var/lib/docker/containers/<id>/*-json.log can be hundreds of GBs. Always set max-size and max-file.

Monitoring host metrics but not container metrics

A host with 50% CPU could be one container at 100% CPU. Per-container metrics let you find the noisy neighbor.

Forgetting to scrape the daemon itself

The Docker daemon exposes Prometheus metrics if enabled (/etc/docker/daemon.json: { "metrics-addr": "0.0.0.0:9323", "experimental": true }). Daemon-level metrics show daemon health, image push/pull rates, etc.

Ignoring restart count

A container with --restart=unless-stopped that flaps every 30 seconds looks "up" most of the time but is broken. container_start_time_seconds change rate catches this.

Real-world usage

  • Small/medium teams: the Compose-based cAdvisor + Prometheus + Grafana + Loki stack. ~30 minutes to set up, covers 90% of needs.
  • Cloud providers: AWS CloudWatch Container Insights, GCP Cloud Monitoring, Azure Container Insights. Managed, no setup, billed per metric.
  • Datadog / New Relic / Honeycomb: SaaS APM with Docker integration. Pay for convenience.
  • Kubernetes-style: Prometheus Operator + kube-state-metrics + node-exporter. Same idea, K8s-native.

Follow-up questions

Q: Can I use docker stats for production monitoring?


A: No. It is a live snapshot, not a TSDB. Useful for ad-hoc inspection ("why is this slow right now?"), useless for trends or alerts.

Q: Why use cAdvisor instead of just docker stats?


A: cAdvisor exposes a Prometheus endpoint, so metrics get scraped, stored, and queryable historically. docker stats does not.

Q: Does Docker daemon itself expose metrics?


A: Yes, if you enable it: /etc/docker/daemon.json with "metrics-addr": "0.0.0.0:9323". Then scrape host:9323/metrics.

Q: What is the difference between metrics and logs?


A: Metrics are aggregates over time (CPU 75% at t=12:00). Logs are events ("GET /api/users 200 in 12ms at t=12:00:01"). Both needed; metrics for alerts and trends, logs for debugging the cause.

Q: (Senior) How do you correlate metrics, logs, and traces in production Docker?


A: Add labels everywhere: container labels (com.docker.stack=myapp), match those in cAdvisor metrics, propagate via Loki labels for logs, and add OpenTelemetry trace IDs to log lines. Grafana's "Explore" view lets you click a metric spike, jump to logs at that timestamp with the same labels, then jump to a trace by trace ID. The infrastructure: cAdvisor + Prometheus + Loki + Tempo behind one Grafana. The hard part: instrumenting the app well, not the deployment.

Examples

Compose-based monitoring stack

yaml
services: cadvisor: image: gcr.io/cadvisor/cadvisor:v0.49.0 privileged: true devices: ["/dev/kmsg"] volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro networks: [monitor] prometheus: image: prom/prometheus:v2.55.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./rules.yml:/etc/prometheus/rules.yml:ro - promdata:/prometheus ports: ["9090:9090"] networks: [monitor] grafana: image: grafana/grafana:11.3.0 ports: ["3001:3000"] environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD} volumes: - grafana:/var/lib/grafana networks: [monitor] alertmanager: image: prom/alertmanager:v0.27.0 volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro ports: ["9093:9093"] networks: [monitor] loki: image: grafana/loki:3.2.0 ports: ["3100:3100"] networks: [monitor] promtail: image: grafana/promtail:3.2.0 volumes: - /var/log:/var/log:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - ./promtail.yml:/etc/promtail/config.yml:ro networks: [monitor] volumes: promdata: grafana: networks: monitor:

Four pillars (metrics from cAdvisor, store in Prometheus, alert via Alertmanager, dashboards in Grafana) plus log shipping (Promtail → Loki). One docker compose up.

Setting log rotation everywhere

yaml
# In your app stack's compose.yaml x-logging: &default-logging driver: json-file options: max-size: "10m" max-file: "3" tag: "{{.Name}}" services: api: image: myapp logging: *default-logging worker: image: myworker logging: *default-logging

YAML anchors apply the same logging config across services. Disk usage stays bounded.

Daemon-level metrics endpoint

json
# /etc/docker/daemon.json { "metrics-addr": "0.0.0.0:9323", "experimental": true }
bash
sudo systemctl restart docker curl http://localhost:9323/metrics | head # # HELP engine_daemon_engine_info ... # # TYPE engine_daemon_engine_info gauge # ...

Now Prometheus can scrape the daemon at host:9323 for engine-level metrics.

Short Answer

Interview ready
Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet