Suggest an editImprove this articleRefine the answer for “How to monitor Docker containers in production?”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)**Production Docker monitoring is a stack**: cAdvisor scrapes container metrics → Prometheus stores them → Grafana visualizes → Alertmanager pages. Plus log aggregation (Loki, ELK, fluentd) and uptime checks. ```yaml services: cadvisor: image: gcr.io/cadvisor/cadvisor volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro prometheus: image: prom/prometheus # scrapes cadvisor:8080 grafana: image: grafana/grafana ``` **Key:** four signals matter — CPU, memory, restart count, healthcheck failures. Alert on all four. Logs are a separate pipeline.Shown above the full answer for quick recall.Answer (EN)Image**Monitoring Docker in production** means knowing four things at all times: is the container alive, is it healthy, is it using too many resources, is it restarting too often. The tools for this are well-established; the work is wiring them together. ## Theory ### TL;DR - **Three layers to monitor:** the host (Linux metrics), the Docker daemon, the containers themselves. - **Standard stack:** cAdvisor (per-container metrics) + Prometheus (TSDB) + Grafana (dashboards) + Alertmanager (paging). - **Logs are separate:** Loki / ELK / fluentd / Datadog Logs. - **What to alert on:** unexpected restarts, healthcheck failures, OOM kills, CPU/memory saturation. - **Critical insight:** container restarts are signals — `docker stats` will not catch a flapping container that runs for 30s and dies. ### What to measure ``` Host level: - CPU/memory/disk/network at host scale - Docker daemon uptime Container level: - CPU usage (% of limit) - Memory usage (vs cgroup limit) - Network I/O - Block I/O - Restart count - Health status (healthy/unhealthy) - Uptime Application level (inside the container): - HTTP latency / errors - Request rate - Custom business metrics ``` Docker-specific monitoring is layers 1-2. App-level metrics are exported BY the app (`/metrics` endpoint, Prometheus scraping). ### The standard stack #### cAdvisor — per-container metrics Google's cAdvisor reads cgroup data and exports it as Prometheus metrics: ```yaml services: cadvisor: image: gcr.io/cadvisor/cadvisor:v0.49.0 container_name: cadvisor privileged: true devices: - /dev/kmsg ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro - /dev/disk:/dev/disk:ro restart: unless-stopped ``` Visit `http://localhost:8080` for a quick UI; metrics at `/metrics`. Key cAdvisor metrics: - `container_cpu_usage_seconds_total` — CPU consumed - `container_memory_working_set_bytes` — actual memory in use - `container_network_receive_bytes_total` / `container_network_transmit_bytes_total` — network I/O - `container_fs_usage_bytes` — disk usage #### Prometheus — store and query ```yaml prometheus: image: prom/prometheus:v2.55.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro - promdata:/prometheus ports: - "9090:9090" command: - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.retention.time=30d ``` ```yaml # prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] - job_name: 'docker-daemon' static_configs: - targets: ['host.docker.internal:9323'] - job_name: 'app' static_configs: - targets: ['api:3000'] metrics_path: /metrics ``` Prometheus scrapes every 15 seconds and stores 30 days of metrics. #### Grafana — dashboards ```yaml grafana: image: grafana/grafana:11.3.0 ports: - "3000:3000" environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD} volumes: - grafana:/var/lib/grafana ``` Import pre-built dashboards: "Docker and System Monitoring" (ID: 893), "Docker Container & Host Metrics" (ID: 10619). Five clicks and you have full container observability. #### Alertmanager — paging ```yaml alertmanager: image: prom/alertmanager:v0.27.0 ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro ``` Define alerts in Prometheus, route through Alertmanager to Slack/PagerDuty/email. ### Essential alerts ```yaml # rules.yml groups: - name: docker rules: - alert: ContainerDown expr: time() - container_last_seen{name!=""} > 300 for: 5m annotations: summary: "Container {{ $labels.name }} not seen for 5 minutes" - alert: ContainerHighMemory expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9 for: 10m annotations: summary: "Container {{ $labels.name }} above 90% memory" - alert: ContainerOOMKilled expr: increase(container_oom_events_total[5m]) > 0 annotations: summary: "Container {{ $labels.name }} OOM-killed in last 5 minutes" - alert: ContainerRestarting expr: increase(container_start_time_seconds[15m]) > 2 annotations: summary: "Container {{ $labels.name }} restarted >2 times in 15 minutes" - alert: ContainerUnhealthy expr: container_health_status == 0 # 0 = unhealthy for: 2m annotations: summary: "Container {{ $labels.name }} unhealthy for 2 minutes" ``` These five cover most production failure modes. ### Logs Metrics tell you something is wrong; logs tell you why. Standard production stacks: - **Loki + Promtail + Grafana** — Prometheus-flavored, log labels match metric labels. - **ELK** (Elasticsearch + Logstash + Kibana) — heavyweight but powerful search. - **Fluentd / Fluent Bit** — log collector, ships to anywhere. - **Vector** — modern alternative to fluentd, lower overhead. At the Docker level, configure log drivers: ```yaml services: api: image: myapp logging: driver: json-file options: max-size: "10m" max-file: "3" # Also: tag, labels for centralized routing ``` Without `max-size`, default `json-file` logs grow forever and fill the disk. ### Health-driven monitoring If your containers have `healthcheck:` defined, Prometheus / cAdvisor expose `container_health_status`. Alert on it. The healthcheck logic itself is your liveness probe. ```yaml services: api: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 30s retries: 3 ``` No healthcheck = monitoring blind spot. ### Common mistakes **No alerts at all** A dashboard nobody watches is not monitoring. Set Alertmanager + paging from day one. "We will check the dashboard" never works. **Logs without retention or rotation** ```yaml services: api: image: myapp # default json-file logs grow without bound ``` After 6 months, `/var/lib/docker/containers/<id>/*-json.log` can be hundreds of GBs. Always set `max-size` and `max-file`. **Monitoring host metrics but not container metrics** A host with 50% CPU could be one container at 100% CPU. Per-container metrics let you find the noisy neighbor. **Forgetting to scrape the daemon itself** The Docker daemon exposes Prometheus metrics if enabled (`/etc/docker/daemon.json: { "metrics-addr": "0.0.0.0:9323", "experimental": true }`). Daemon-level metrics show daemon health, image push/pull rates, etc. **Ignoring restart count** A container with `--restart=unless-stopped` that flaps every 30 seconds looks "up" most of the time but is broken. `container_start_time_seconds` change rate catches this. ### Real-world usage - **Small/medium teams:** the Compose-based cAdvisor + Prometheus + Grafana + Loki stack. ~30 minutes to set up, covers 90% of needs. - **Cloud providers:** AWS CloudWatch Container Insights, GCP Cloud Monitoring, Azure Container Insights. Managed, no setup, billed per metric. - **Datadog / New Relic / Honeycomb:** SaaS APM with Docker integration. Pay for convenience. - **Kubernetes-style:** Prometheus Operator + kube-state-metrics + node-exporter. Same idea, K8s-native. ### Follow-up questions **Q:** Can I use `docker stats` for production monitoring? **A:** No. It is a live snapshot, not a TSDB. Useful for ad-hoc inspection ("why is this slow right now?"), useless for trends or alerts. **Q:** Why use cAdvisor instead of just `docker stats`? **A:** cAdvisor exposes a Prometheus endpoint, so metrics get scraped, stored, and queryable historically. `docker stats` does not. **Q:** Does Docker daemon itself expose metrics? **A:** Yes, if you enable it: `/etc/docker/daemon.json` with `"metrics-addr": "0.0.0.0:9323"`. Then scrape `host:9323/metrics`. **Q:** What is the difference between metrics and logs? **A:** Metrics are aggregates over time (CPU 75% at t=12:00). Logs are events ("GET /api/users 200 in 12ms at t=12:00:01"). Both needed; metrics for alerts and trends, logs for debugging the cause. **Q:** (Senior) How do you correlate metrics, logs, and traces in production Docker? **A:** Add labels everywhere: container labels (`com.docker.stack=myapp`), match those in cAdvisor metrics, propagate via Loki labels for logs, and add OpenTelemetry trace IDs to log lines. Grafana's "Explore" view lets you click a metric spike, jump to logs at that timestamp with the same labels, then jump to a trace by trace ID. The infrastructure: cAdvisor + Prometheus + Loki + Tempo behind one Grafana. The hard part: instrumenting the app well, not the deployment. ## Examples ### Compose-based monitoring stack ```yaml services: cadvisor: image: gcr.io/cadvisor/cadvisor:v0.49.0 privileged: true devices: ["/dev/kmsg"] volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro networks: [monitor] prometheus: image: prom/prometheus:v2.55.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./rules.yml:/etc/prometheus/rules.yml:ro - promdata:/prometheus ports: ["9090:9090"] networks: [monitor] grafana: image: grafana/grafana:11.3.0 ports: ["3001:3000"] environment: GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD} volumes: - grafana:/var/lib/grafana networks: [monitor] alertmanager: image: prom/alertmanager:v0.27.0 volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro ports: ["9093:9093"] networks: [monitor] loki: image: grafana/loki:3.2.0 ports: ["3100:3100"] networks: [monitor] promtail: image: grafana/promtail:3.2.0 volumes: - /var/log:/var/log:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - ./promtail.yml:/etc/promtail/config.yml:ro networks: [monitor] volumes: promdata: grafana: networks: monitor: ``` Four pillars (metrics from cAdvisor, store in Prometheus, alert via Alertmanager, dashboards in Grafana) plus log shipping (Promtail → Loki). One `docker compose up`. ### Setting log rotation everywhere ```yaml # In your app stack's compose.yaml x-logging: &default-logging driver: json-file options: max-size: "10m" max-file: "3" tag: "{{.Name}}" services: api: image: myapp logging: *default-logging worker: image: myworker logging: *default-logging ``` YAML anchors apply the same logging config across services. Disk usage stays bounded. ### Daemon-level metrics endpoint ```json # /etc/docker/daemon.json { "metrics-addr": "0.0.0.0:9323", "experimental": true } ``` ```bash sudo systemctl restart docker curl http://localhost:9323/metrics | head # # HELP engine_daemon_engine_info ... # # TYPE engine_daemon_engine_info gauge # ... ``` Now Prometheus can scrape the daemon at `host:9323` for engine-level metrics.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.