How to monitor Docker containers in production?
Monitoring Docker in production means knowing four things at all times: is the container alive, is it healthy, is it using too many resources, is it restarting too often. The tools for this are well-established; the work is wiring them together.
Theory
TL;DR
- Three layers to monitor: the host (Linux metrics), the Docker daemon, the containers themselves.
- Standard stack: cAdvisor (per-container metrics) + Prometheus (TSDB) + Grafana (dashboards) + Alertmanager (paging).
- Logs are separate: Loki / ELK / fluentd / Datadog Logs.
- What to alert on: unexpected restarts, healthcheck failures, OOM kills, CPU/memory saturation.
- Critical insight: container restarts are signals —
docker statswill not catch a flapping container that runs for 30s and dies.
What to measure
Host level:
- CPU/memory/disk/network at host scale
- Docker daemon uptime
Container level:
- CPU usage (% of limit)
- Memory usage (vs cgroup limit)
- Network I/O
- Block I/O
- Restart count
- Health status (healthy/unhealthy)
- Uptime
Application level (inside the container):
- HTTP latency / errors
- Request rate
- Custom business metricsDocker-specific monitoring is layers 1-2. App-level metrics are exported BY the app (/metrics endpoint, Prometheus scraping).
The standard stack
cAdvisor — per-container metrics
Google's cAdvisor reads cgroup data and exports it as Prometheus metrics:
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.0
container_name: cadvisor
privileged: true
devices:
- /dev/kmsg
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
restart: unless-stoppedVisit http://localhost:8080 for a quick UI; metrics at /metrics.
Key cAdvisor metrics:
container_cpu_usage_seconds_total— CPU consumedcontainer_memory_working_set_bytes— actual memory in usecontainer_network_receive_bytes_total/container_network_transmit_bytes_total— network I/Ocontainer_fs_usage_bytes— disk usage
Prometheus — store and query
prometheus:
image: prom/prometheus:v2.55.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- promdata:/prometheus
ports:
- "9090:9090"
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'docker-daemon'
static_configs:
- targets: ['host.docker.internal:9323']
- job_name: 'app'
static_configs:
- targets: ['api:3000']
metrics_path: /metricsPrometheus scrapes every 15 seconds and stores 30 days of metrics.
Grafana — dashboards
grafana:
image: grafana/grafana:11.3.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
volumes:
- grafana:/var/lib/grafanaImport pre-built dashboards: "Docker and System Monitoring" (ID: 893), "Docker Container & Host Metrics" (ID: 10619). Five clicks and you have full container observability.
Alertmanager — paging
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:roDefine alerts in Prometheus, route through Alertmanager to Slack/PagerDuty/email.
Essential alerts
# rules.yml
groups:
- name: docker
rules:
- alert: ContainerDown
expr: time() - container_last_seen{name!=""} > 300
for: 5m
annotations:
summary: "Container {{ $labels.name }} not seen for 5 minutes"
- alert: ContainerHighMemory
expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
for: 10m
annotations:
summary: "Container {{ $labels.name }} above 90% memory"
- alert: ContainerOOMKilled
expr: increase(container_oom_events_total[5m]) > 0
annotations:
summary: "Container {{ $labels.name }} OOM-killed in last 5 minutes"
- alert: ContainerRestarting
expr: increase(container_start_time_seconds[15m]) > 2
annotations:
summary: "Container {{ $labels.name }} restarted >2 times in 15 minutes"
- alert: ContainerUnhealthy
expr: container_health_status == 0 # 0 = unhealthy
for: 2m
annotations:
summary: "Container {{ $labels.name }} unhealthy for 2 minutes"These five cover most production failure modes.
Logs
Metrics tell you something is wrong; logs tell you why. Standard production stacks:
- Loki + Promtail + Grafana — Prometheus-flavored, log labels match metric labels.
- ELK (Elasticsearch + Logstash + Kibana) — heavyweight but powerful search.
- Fluentd / Fluent Bit — log collector, ships to anywhere.
- Vector — modern alternative to fluentd, lower overhead.
At the Docker level, configure log drivers:
services:
api:
image: myapp
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
# Also: tag, labels for centralized routingWithout max-size, default json-file logs grow forever and fill the disk.
Health-driven monitoring
If your containers have healthcheck: defined, Prometheus / cAdvisor expose container_health_status. Alert on it. The healthcheck logic itself is your liveness probe.
services:
api:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
retries: 3No healthcheck = monitoring blind spot.
Common mistakes
No alerts at all
A dashboard nobody watches is not monitoring. Set Alertmanager + paging from day one. "We will check the dashboard" never works.
Logs without retention or rotation
services:
api:
image: myapp
# default json-file logs grow without boundAfter 6 months, /var/lib/docker/containers/<id>/*-json.log can be hundreds of GBs. Always set max-size and max-file.
Monitoring host metrics but not container metrics
A host with 50% CPU could be one container at 100% CPU. Per-container metrics let you find the noisy neighbor.
Forgetting to scrape the daemon itself
The Docker daemon exposes Prometheus metrics if enabled (/etc/docker/daemon.json: { "metrics-addr": "0.0.0.0:9323", "experimental": true }). Daemon-level metrics show daemon health, image push/pull rates, etc.
Ignoring restart count
A container with --restart=unless-stopped that flaps every 30 seconds looks "up" most of the time but is broken. container_start_time_seconds change rate catches this.
Real-world usage
- Small/medium teams: the Compose-based cAdvisor + Prometheus + Grafana + Loki stack. ~30 minutes to set up, covers 90% of needs.
- Cloud providers: AWS CloudWatch Container Insights, GCP Cloud Monitoring, Azure Container Insights. Managed, no setup, billed per metric.
- Datadog / New Relic / Honeycomb: SaaS APM with Docker integration. Pay for convenience.
- Kubernetes-style: Prometheus Operator + kube-state-metrics + node-exporter. Same idea, K8s-native.
Follow-up questions
Q: Can I use docker stats for production monitoring?
A: No. It is a live snapshot, not a TSDB. Useful for ad-hoc inspection ("why is this slow right now?"), useless for trends or alerts.
Q: Why use cAdvisor instead of just docker stats?
A: cAdvisor exposes a Prometheus endpoint, so metrics get scraped, stored, and queryable historically. docker stats does not.
Q: Does Docker daemon itself expose metrics?
A: Yes, if you enable it: /etc/docker/daemon.json with "metrics-addr": "0.0.0.0:9323". Then scrape host:9323/metrics.
Q: What is the difference between metrics and logs?
A: Metrics are aggregates over time (CPU 75% at t=12:00). Logs are events ("GET /api/users 200 in 12ms at t=12:00:01"). Both needed; metrics for alerts and trends, logs for debugging the cause.
Q: (Senior) How do you correlate metrics, logs, and traces in production Docker?
A: Add labels everywhere: container labels (com.docker.stack=myapp), match those in cAdvisor metrics, propagate via Loki labels for logs, and add OpenTelemetry trace IDs to log lines. Grafana's "Explore" view lets you click a metric spike, jump to logs at that timestamp with the same labels, then jump to a trace by trace ID. The infrastructure: cAdvisor + Prometheus + Loki + Tempo behind one Grafana. The hard part: instrumenting the app well, not the deployment.
Examples
Compose-based monitoring stack
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.0
privileged: true
devices: ["/dev/kmsg"]
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
networks: [monitor]
prometheus:
image: prom/prometheus:v2.55.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./rules.yml:/etc/prometheus/rules.yml:ro
- promdata:/prometheus
ports: ["9090:9090"]
networks: [monitor]
grafana:
image: grafana/grafana:11.3.0
ports: ["3001:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
volumes:
- grafana:/var/lib/grafana
networks: [monitor]
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports: ["9093:9093"]
networks: [monitor]
loki:
image: grafana/loki:3.2.0
ports: ["3100:3100"]
networks: [monitor]
promtail:
image: grafana/promtail:3.2.0
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail.yml:/etc/promtail/config.yml:ro
networks: [monitor]
volumes:
promdata:
grafana:
networks:
monitor:Four pillars (metrics from cAdvisor, store in Prometheus, alert via Alertmanager, dashboards in Grafana) plus log shipping (Promtail → Loki). One docker compose up.
Setting log rotation everywhere
# In your app stack's compose.yaml
x-logging: &default-logging
driver: json-file
options:
max-size: "10m"
max-file: "3"
tag: "{{.Name}}"
services:
api:
image: myapp
logging: *default-logging
worker:
image: myworker
logging: *default-loggingYAML anchors apply the same logging config across services. Disk usage stays bounded.
Daemon-level metrics endpoint
# /etc/docker/daemon.json
{
"metrics-addr": "0.0.0.0:9323",
"experimental": true
}sudo systemctl restart docker
curl http://localhost:9323/metrics | head
# # HELP engine_daemon_engine_info ...
# # TYPE engine_daemon_engine_info gauge
# ...Now Prometheus can scrape the daemon at host:9323 for engine-level metrics.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.
Comments
No comments yet