How to set up a health check for a Docker container?

docs.questions.sections.docker~4 min read

Container health checks are how Docker (and Compose, Swarm, K8s) tell the difference between "the process is up" and "the app is actually working". Without a healthcheck, the only signal is "is PID 1 alive?", which misses every interesting failure mode.

Theory

TL;DR

A healthcheck is a command Docker runs inside the container periodically. Exit 0 = healthy; non-zero = unhealthy.
Three states: starting (still in start_period), healthy, unhealthy (failed retries times in a row).
Set via HEALTHCHECK in Dockerfile, --health-cmd on docker run, or healthcheck: in Compose.
Used by docker ps, by Compose depends_on: service_healthy, by Swarm to decide replica replacement.
Common command: curl -f http://localhost:<port>/health. Picky details: command must exist inside the container.

Quick example

dockerfile

FROM node:22-alpine
WORKDIR /app
COPY . .
RUN npm ci --omit=dev
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --quiet --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "server.js"]

bash

$ docker run -d --name api myapp
$ docker ps
CONTAINER ID   IMAGE   STATUS                              NAMES
a3f9d2b8c1e4   myapp   Up 30 seconds (healthy)             api
# Status now includes (healthy) / (unhealthy) / (starting)

After ~30 seconds, the first check runs. If it succeeds, status flips to (healthy).

The four flags that matter

--interval=DURATION    # how often to run the check (default 30s)
--timeout=DURATION     # max time the check has to return (default 30s)
--retries=N            # how many failures before unhealthy (default 3)
--start-period=DURATION # grace period at startup; failures here do not count (default 0s)

For a typical web service: --interval=30s --timeout=3s --retries=3 --start-period=10s works well. Apps that take 30+ seconds to warm up (JVMs, big Python ML services) need a longer --start-period (60-120s).

Compose syntax

yaml

services:
  api:
    image: myapp
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 3s
      retries: 3
      start_period: 10s
  web:
    image: nginx
    depends_on:
      api:
        condition: service_healthy   # wait until api is healthy before starting

depends_on: condition: service_healthy is the killer feature — it waits for the dep's healthcheck to pass before starting dependents. Far more robust than the simple list form.

Three forms of the test command

yaml

# Form 1: CMD (preferred — no shell)
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]

# Form 2: CMD-SHELL (with shell, lets you use && || env var expansion)
test: ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]

# Form 3: NONE (disable inherited healthcheck from base image)
test: ["NONE"]

CMD form is faster (no shell process). CMD-SHELL is needed for env var expansion or shell logic.

Common failure: command not found in container

Most-common health-check bug:

dockerfile

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

but the image is alpine without curl installed. Healthcheck always fails.

Solutions:

RUN apk add --no-cache curl (or for Alpine, often wget is already there: wget -q --spider URL)
For Node apps: node -e "require('http').get('http://localhost:3000/health', r => process.exit(r.statusCode===200?0:1))" (no extra package needed)
Distroless images often need a binary built in: include a /healthcheck binary in your Dockerfile that exits 0/1

Inspecting health

bash

# Current status in ps
docker ps --format 'table {{.Names}}\t{{.Status}}'

# Full health details
docker inspect api --format '{{json .State.Health}}' | jq
# {
#   "Status": "healthy",
#   "FailingStreak": 0,
#   "Log": [
#     { "Start": "...", "End": "...", "ExitCode": 0, "Output": "..." },
#     ...
#   ]
# }

# Watch live
watch 'docker ps --format "table {{.Names}}\t{{.Status}}"'

The Log array keeps the last 5 healthcheck results — invaluable for debugging "why is this unhealthy?".

Common mistakes

No start_period, healthchecks fail during slow startup

yaml

# WRONG: app needs 60s to warm up; first 3 fails make it unhealthy
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 5s
  retries: 3
# After 15s the container is unhealthy and might get restarted

# RIGHT: start_period gives a grace window
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 30s
  start_period: 60s    # failures during this window do not count

Healthcheck that depends on a dependency

yaml

# WRONG: api healthcheck pings db; if db is down briefly, api goes unhealthy
test: ["CMD", "sh", "-c", "curl -f http://localhost:3000/health && pg_isready -h db"]

Mix dependencies into your liveness check and a transient db blip restarts your api. Better: healthcheck only checks the container's own readiness; have a separate /readiness endpoint if your app needs to gate traffic on dependency health.

Hitting external URL in healthcheck

dockerfile

HEALTHCHECK CMD curl -f https://api.example.com/health || exit 1

Now your container's health depends on someone else's uptime. Don't.

Disabling healthcheck without realizing it

dockerfile

FROM postgres:16
# inherits postgres' healthcheck. If your app does not exit 0 there, you get unhealthy.

# Either set your own:
HEALTHCHECK CMD pg_isready -U postgres
# Or disable inherited:
HEALTHCHECK NONE

Real-world usage

Compose with depends_on: service_healthy: wait for db to be ready before starting api. The biggest practical use.
Swarm orchestration: unhealthy replicas are killed and replaced. Healthcheck is the signal.
Reverse proxy integration: Traefik and nginx-proxy can inspect Docker healthcheck state to route only to healthy containers.
Monitoring dashboards: scrape docker inspect output for health status, alert on unhealthy.

Follow-up questions

Q: What is the difference between Docker healthcheck and Kubernetes liveness/readiness probes?

A: Same idea, different scope. K8s splits liveness (am I alive?) and readiness (am I ready for traffic?), with separate behaviors (liveness restarts; readiness removes from service). Docker has just one combined healthcheck. K8s does not use Docker's healthcheck — it has its own.

Q: What signal does an unhealthy container get?

A: None — being unhealthy does not auto-restart by itself. With --restart=on-failure it does not help (no exit code). With Swarm or Compose-with-orchestrator, the orchestrator decides to replace the unhealthy task. With plain docker run, you (or your monitor) act.

Q: Can I have multiple healthchecks?

A: Only one per container. Combine logic inside one CMD-SHELL command if needed.

Q: How do I disable the inherited healthcheck from a base image?

A: HEALTHCHECK NONE in your Dockerfile, or test: ["NONE"] in Compose.

Q: (Senior) When should the healthcheck do more than curl /health?

A: Add a smarter /health endpoint inside the app that checks downstream readiness (DB connection pool not exhausted, queue not backed up beyond a threshold). Keep the healthcheck command itself simple — the intelligence belongs in the endpoint, not in shell. For services with long warmup, expose /livez (always returns OK once the process is up) and /readyz (true app readiness) separately, and use a different K8s-style approach in Swarm via two endpoints.

Examples

Compose stack with healthcheck-gated startup

yaml

services:
  api:
    build: .
    depends_on:
      db:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:3000/health || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 15s

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: dev
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      retries: 5
      start_period: 5s

bash

$ docker compose up -d
[+] Running 2/2
 ✔ Container db   Healthy   1.4s
 ✔ Container api  Healthy   12.3s

Compose waits for db to be healthy before starting api. No race conditions, no "connection refused" on first run.

Node app with built-in healthcheck (no extra packages)

dockerfile

FROM node:22-alpine
WORKDIR /app
COPY . .
RUN npm ci --omit=dev
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', r => process.exit(r.statusCode === 200 ? 0 : 1))"
CMD ["node", "server.js"]

No curl needed — uses the Node runtime that is already there. Smaller image, no extra package.

Watching health logs

bash

$ docker inspect api --format '{{range .State.Health.Log}}{{.End}}: exit={{.ExitCode}} out={{.Output}}\n{{end}}'
2026-04-30T10:00:00Z: exit=0 out=ok
2026-04-30T10:00:30Z: exit=0 out=ok
2026-04-30T10:01:00Z: exit=1 out=connection refused
2026-04-30T10:01:30Z: exit=1 out=connection refused
2026-04-30T10:02:00Z: exit=0 out=ok

Last 5 results with exit codes and stdout. Often answers "why is/was this unhealthy?" without further investigation.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet