Suggest an edit

Improve this article

Refine the answer for “How to monitor Docker containers in production?”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Monitoring Docker in production** means knowing four things at all times: is the container alive, is it healthy, is it using too many resources, is it restarting too often. The tools for this are well-established; the work is wiring them together.

## Theory

### TL;DR

- **Three layers to monitor:** the host (Linux metrics), the Docker daemon, the containers themselves.
- **Standard stack:** cAdvisor (per-container metrics) + Prometheus (TSDB) + Grafana (dashboards) + Alertmanager (paging).
- **Logs are separate:** Loki / ELK / fluentd / Datadog Logs.
- **What to alert on:** unexpected restarts, healthcheck failures, OOM kills, CPU/memory saturation.
- **Critical insight:** container restarts are signals — `docker stats` will not catch a flapping container that runs for 30s and dies.

### What to measure

```
Host level:
  - CPU/memory/disk/network at host scale
  - Docker daemon uptime

Container level:
  - CPU usage (% of limit)
  - Memory usage (vs cgroup limit)
  - Network I/O
  - Block I/O
  - Restart count
  - Health status (healthy/unhealthy)
  - Uptime

Application level (inside the container):
  - HTTP latency / errors
  - Request rate
  - Custom business metrics
```

Docker-specific monitoring is layers 1-2. App-level metrics are exported BY the app (`/metrics` endpoint, Prometheus scraping).

### The standard stack

#### cAdvisor — per-container metrics

Google's cAdvisor reads cgroup data and exports it as Prometheus metrics:

```yaml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    container_name: cadvisor
    privileged: true
    devices:
      - /dev/kmsg
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    restart: unless-stopped
```

Visit `http://localhost:8080` for a quick UI; metrics at `/metrics`.

Key cAdvisor metrics:
- `container_cpu_usage_seconds_total` — CPU consumed
- `container_memory_working_set_bytes` — actual memory in use
- `container_network_receive_bytes_total` / `container_network_transmit_bytes_total` — network I/O
- `container_fs_usage_bytes` — disk usage

#### Prometheus — store and query

```yaml
  prometheus:
    image: prom/prometheus:v2.55.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - promdata:/prometheus
    ports:
      - "9090:9090"
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
```

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
  - job_name: 'docker-daemon'
    static_configs:
      - targets: ['host.docker.internal:9323']
  - job_name: 'app'
    static_configs:
      - targets: ['api:3000']
    metrics_path: /metrics
```

Prometheus scrapes every 15 seconds and stores 30 days of metrics.

#### Grafana — dashboards

```yaml
  grafana:
    image: grafana/grafana:11.3.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana:/var/lib/grafana
```

Import pre-built dashboards: "Docker and System Monitoring" (ID: 893), "Docker Container & Host Metrics" (ID: 10619). Five clicks and you have full container observability.

#### Alertmanager — paging

```yaml
  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
```

Define alerts in Prometheus, route through Alertmanager to Slack/PagerDuty/email.

### Essential alerts

```yaml
# rules.yml
groups:
- name: docker
  rules:
  - alert: ContainerDown
    expr: time() - container_last_seen{name!=""} > 300
    for: 5m
    annotations:
      summary: "Container {{ $labels.name }} not seen for 5 minutes"

- alert: ContainerHighMemory
    expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
    for: 10m
    annotations:
      summary: "Container {{ $labels.name }} above 90% memory"

- alert: ContainerOOMKilled
    expr: increase(container_oom_events_total[5m]) > 0
    annotations:
      summary: "Container {{ $labels.name }} OOM-killed in last 5 minutes"

- alert: ContainerRestarting
    expr: increase(container_start_time_seconds[15m]) > 2
    annotations:
      summary: "Container {{ $labels.name }} restarted >2 times in 15 minutes"

- alert: ContainerUnhealthy
    expr: container_health_status == 0   # 0 = unhealthy
    for: 2m
    annotations:
      summary: "Container {{ $labels.name }} unhealthy for 2 minutes"
```

These five cover most production failure modes.

### Logs

Metrics tell you something is wrong; logs tell you why. Standard production stacks:

- **Loki + Promtail + Grafana** — Prometheus-flavored, log labels match metric labels.
- **ELK** (Elasticsearch + Logstash + Kibana) — heavyweight but powerful search.
- **Fluentd / Fluent Bit** — log collector, ships to anywhere.
- **Vector** — modern alternative to fluentd, lower overhead.

At the Docker level, configure log drivers:

```yaml
services:
  api:
    image: myapp
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
        # Also: tag, labels for centralized routing
```

Without `max-size`, default `json-file` logs grow forever and fill the disk.

### Health-driven monitoring

If your containers have `healthcheck:` defined, Prometheus / cAdvisor expose `container_health_status`. Alert on it. The healthcheck logic itself is your liveness probe.

```yaml
services:
  api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      retries: 3
```

No healthcheck = monitoring blind spot.

### Common mistakes

**No alerts at all**

A dashboard nobody watches is not monitoring. Set Alertmanager + paging from day one. "We will check the dashboard" never works.

**Logs without retention or rotation**

```yaml
services:
  api:
    image: myapp
    # default json-file logs grow without bound
```

After 6 months, `/var/lib/docker/containers/<id>/*-json.log` can be hundreds of GBs. Always set `max-size` and `max-file`.

**Monitoring host metrics but not container metrics**

A host with 50% CPU could be one container at 100% CPU. Per-container metrics let you find the noisy neighbor.

**Forgetting to scrape the daemon itself**

The Docker daemon exposes Prometheus metrics if enabled (`/etc/docker/daemon.json: { "metrics-addr": "0.0.0.0:9323", "experimental": true }`). Daemon-level metrics show daemon health, image push/pull rates, etc.

**Ignoring restart count**

A container with `--restart=unless-stopped` that flaps every 30 seconds looks "up" most of the time but is broken. `container_start_time_seconds` change rate catches this.

### Real-world usage

- **Small/medium teams:** the Compose-based cAdvisor + Prometheus + Grafana + Loki stack. ~30 minutes to set up, covers 90% of needs.
- **Cloud providers:** AWS CloudWatch Container Insights, GCP Cloud Monitoring, Azure Container Insights. Managed, no setup, billed per metric.
- **Datadog / New Relic / Honeycomb:** SaaS APM with Docker integration. Pay for convenience.
- **Kubernetes-style:** Prometheus Operator + kube-state-metrics + node-exporter. Same idea, K8s-native.

### Follow-up questions

**Q:** Can I use `docker stats` for production monitoring?

**A:** No. It is a live snapshot, not a TSDB. Useful for ad-hoc inspection ("why is this slow right now?"), useless for trends or alerts.

**Q:** Why use cAdvisor instead of just `docker stats`?

**A:** cAdvisor exposes a Prometheus endpoint, so metrics get scraped, stored, and queryable historically. `docker stats` does not.

**Q:** Does Docker daemon itself expose metrics?

**A:** Yes, if you enable it: `/etc/docker/daemon.json` with `"metrics-addr": "0.0.0.0:9323"`. Then scrape `host:9323/metrics`.

**Q:** What is the difference between metrics and logs?

**A:** Metrics are aggregates over time (CPU 75% at t=12:00). Logs are events ("GET /api/users 200 in 12ms at t=12:00:01"). Both needed; metrics for alerts and trends, logs for debugging the cause.

**Q:** (Senior) How do you correlate metrics, logs, and traces in production Docker?

**A:** Add labels everywhere: container labels (`com.docker.stack=myapp`), match those in cAdvisor metrics, propagate via Loki labels for logs, and add OpenTelemetry trace IDs to log lines. Grafana's "Explore" view lets you click a metric spike, jump to logs at that timestamp with the same labels, then jump to a trace by trace ID. The infrastructure: cAdvisor + Prometheus + Loki + Tempo behind one Grafana. The hard part: instrumenting the app well, not the deployment.

## Examples

### Compose-based monitoring stack

```yaml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    privileged: true
    devices: ["/dev/kmsg"]
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    networks: [monitor]

prometheus:
    image: prom/prometheus:v2.55.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules.yml:/etc/prometheus/rules.yml:ro
      - promdata:/prometheus
    ports: ["9090:9090"]
    networks: [monitor]

grafana:
    image: grafana/grafana:11.3.0
    ports: ["3001:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana:/var/lib/grafana
    networks: [monitor]

alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports: ["9093:9093"]
    networks: [monitor]

loki:
    image: grafana/loki:3.2.0
    ports: ["3100:3100"]
    networks: [monitor]

promtail:
    image: grafana/promtail:3.2.0
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail.yml:/etc/promtail/config.yml:ro
    networks: [monitor]

volumes:
  promdata:
  grafana:

networks:
  monitor:
```

Four pillars (metrics from cAdvisor, store in Prometheus, alert via Alertmanager, dashboards in Grafana) plus log shipping (Promtail → Loki). One `docker compose up`.

### Setting log rotation everywhere

```yaml
# In your app stack's compose.yaml
x-logging: &default-logging
  driver: json-file
  options:
    max-size: "10m"
    max-file: "3"
    tag: "{{.Name}}"

services:
  api:
    image: myapp
    logging: *default-logging
  worker:
    image: myworker
    logging: *default-logging
```

YAML anchors apply the same logging config across services. Disk usage stays bounded.

### Daemon-level metrics endpoint

```json
# /etc/docker/daemon.json
{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}
```

```bash
sudo systemctl restart docker
curl http://localhost:9323/metrics | head
# # HELP engine_daemon_engine_info ...
# # TYPE engine_daemon_engine_info gauge
# ...
```

Now Prometheus can scrape the daemon at `host:9323` for engine-level metrics.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1316 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.