Suggest an edit

Improve this article

Refine the answer for “Zero-downtime deployment approaches with Docker”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Zero-downtime deployment** means upgrading a running service without dropping requests, breaking sessions, or returning errors during the rollout. With Docker, three deploy strategies cover most cases: rolling updates, blue-green, and canary. The strategy is half the picture — the other half is healthchecks, graceful shutdown, and DB migration discipline.

## Theory

### TL;DR

- **Rolling update**: gradually replace replicas. Cheapest, slowest cutover, partial-state during rollout.
- **Blue-green**: run two full environments; flip traffic at once. Atomic, instant rollback, 2x cost.
- **Canary**: shift a small fraction first, ramp up if healthy. Catches subtle regressions.
- **Required ingredients**:
  - Healthchecks (the LB must know who is ready)
  - Graceful shutdown (handle SIGTERM, finish in-flight, exit)
  - DB migrations are expand-then-contract
  - Connection draining

### Strategy comparison

| Strategy | Atomicity | Rollback speed | Resource cost | Best for |
|---|---|---|---|---|
| Rolling update | Gradual (N at a time) | Slow (re-deploy old) | 1.0-1.2x | Default for stateless services |
| Blue-green | Atomic flip | Instant (flip back) | 2x during deploy | High-confidence releases |
| Canary | Gradual (% traffic) | Stop ramp + drain | 1.05-1.5x | Risky changes, want to catch slow regressions |

### What goes wrong without proper plumbing

- **No healthchecks**: load balancer routes to a container that has not finished startup, returns 502.
- **No graceful shutdown**: in-flight requests get dropped when the old container is killed.
- **Breaking DB migration**: new code expects the new column; old code crashes when the column gets dropped during the cutover.
- **No connection draining**: long-lived connections (WebSockets, HTTP/2) get severed.
- **Wrong restart policy**: replicas crash and never come back.

Fix each piece before strategy choice matters.

### Healthchecks

A healthcheck tells the orchestrator (or load balancer) when a replica is ready to receive traffic.

**Two types:**
1. **Liveness**: "Is the process alive?" If not, restart.
2. **Readiness**: "Is the process ready for traffic?" If not, remove from LB rotation.

Readiness is the one that enables zero-downtime. During startup, readiness should return false until DB connections, caches, and warmup are done.

```dockerfile
HEALTHCHECK --interval=10s --timeout=3s --start-period=30s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1
```

App implements `/health` to return 200 only when ready.

### Graceful shutdown

When the orchestrator stops a container:
1. Sends SIGTERM.
2. Waits up to `stop_grace_period` (default 10s).
3. Sends SIGKILL.

During SIGTERM-to-SIGKILL window, the app should:
1. Stop accepting new connections (close listening socket).
2. Finish in-flight requests.
3. Cleanly shut down DB connections, flush logs, exit.

Go example:
```go
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGTERM)
<-sigs
ctx, cancel := context.WithTimeout(context.Background(), 25 * time.Second)
defer cancel()
server.Shutdown(ctx)   // waits for in-flight requests
```

Node.js:
```js
process.on('SIGTERM', async () => {
  server.close()  // stops accepting new
  await pool.end()  // close DB pool
  process.exit(0)
})
```

Set `stop_grace_period: 30s` if your shutdown can take that long.

### DB migration discipline

**Bad: breaking migration during deploy**

Deploy app v2 + run `ALTER TABLE users DROP COLUMN old_field` simultaneously. App v1 still queries `old_field`, errors during the cutover window. The whole deploy looks broken.

**Good: expand-then-contract over multiple deploys**

1. **Expand** (deploy 1): add new structure (new column, new table). v1 still works because old structure is intact.
2. **Migrate code** (deploy 2): app v2 reads/writes both old and new. v1 and v2 coexist during the cutover.
3. **Contract** (deploy 3): once all v1 is gone, drop the old structure.

Three deploys for one logical change, but each is safe.

## Examples

### Strategy 1: Rolling update (Swarm)

```bash
docker service create \
    --name api \
    --replicas 4 \
    --update-parallelism 1 \
    --update-delay 30s \
    --update-failure-action rollback \
    --update-monitor 30s \
    --update-max-failure-ratio 0.0 \
    --health-cmd 'curl -f http://localhost:8080/health' \
    --health-interval 10s \
    --health-start-period 30s \
    -p 8080:8080 \
    myorg/api:1.0

# Update to v2
docker service update --image myorg/api:2.0 api
```

**What happens:**
- Swarm stops 1 replica (sends SIGTERM, waits, kills).
- Starts 1 new replica with v2.
- Waits for it to pass healthcheck.
- Waits 30s monitor period.
- If healthy: repeat for next replica.
- If unhealthy: stop and rollback.

Use `--update-parallelism 2` to update 2 at once (faster, slightly more risk).

### Strategy 2: Blue-green (Compose + reverse proxy)

```yaml
# compose.yaml — blue active
services:
  traefik:
    image: traefik:v3
    command:
      - --providers.docker
      - --entrypoints.web.address=:80
    ports: ["80:80"]
    volumes: ["/var/run/docker.sock:/var/run/docker.sock:ro"]

api-blue:
    image: myorg/api:1.0
    labels:
      - traefik.enable=true
      - 'traefik.http.routers.api.rule=Host(`api.example.com`)'
      - traefik.http.services.api.loadbalancer.server.port=8080
```

Deploy v2:
```bash
# Bring up green WITHOUT traffic
docker run -d --name api-green --network=trafnet myorg/api:2.0

# Smoke-test green directly
docker run --rm --network=trafnet curlimages/curl curl -f http://api-green:8080/health

# Cutover: switch labels to green
# (most easily done by docker-compose with new file or via Swarm services)
# Traefik picks up the change in seconds

# Drain in-flight on blue
sleep 30

# Stop blue
docker stop api-blue && docker rm api-blue
```

Rollback:
```bash
# Revert labels back to blue (which is still around)
# Or, if blue was removed:
docker run -d --name api-blue --network=trafnet myorg/api:1.0
# Cut traffic back
```

### Strategy 3: Canary (Traefik weighted routing)

```yaml
# Two services with weighted load balancer
http:
  services:
    api:
      weighted:
        services:
          - name: api-stable
            weight: 90
          - name: api-canary
            weight: 10
```

```bash
# Deploy canary
docker run -d --name api-canary --network=trafnet \
    --label "traefik.http.routers.canary.rule=Host(\`api.example.com\`)" \
    myorg/api:2.0

# Watch metrics for 30 min
# If healthy, ramp to 50/50, then 0/100
# If problems, set canary weight to 0 and remove
```

Kubernetes / Argo Rollouts / Flagger automate this with metric-driven analysis ("if error rate > 1% over 5 min, rollback").

### Connection draining (Swarm/Compose)

```yaml
services:
  api:
    image: myorg/api:1.0
    stop_grace_period: 30s    # how long to wait between SIGTERM and SIGKILL
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 3s
      start_period: 30s
```

Combined with a graceful-shutdown handler in the app, in-flight requests complete during the 30s window.

### Long-lived connections (WebSockets, gRPC streams)

These do not gracefully drain in 30 seconds — clients hold them indefinitely.

Options:
- **Implement reconnect** in the client. Server-side: `Connection: close` for HTTP/1.1, `GOAWAY` for HTTP/2/gRPC, server-side close for WebSockets. Client reconnects, lands on the new replica.
- **Long grace period**: set `stop_grace_period: 10m` so connections drain naturally over 10 minutes.
- **Sticky pool of "old" replicas** that are not in rotation but accept the existing connections; new connections go to new replicas. Trickier to orchestrate.

### Database migrations in production

```sql
-- Step 1 (deploy 1): expand
ALTER TABLE users ADD COLUMN email_canonical VARCHAR(255);
-- Old code: ignores it. New code: writes to both old and new.

-- Backfill (between deploys)
UPDATE users SET email_canonical = LOWER(email) WHERE email_canonical IS NULL;

-- Step 2 (deploy 2): code migrates fully to new column
-- App reads from email_canonical, writes to both for safety

-- Step 3 (deploy 3): contract
ALTER TABLE users DROP COLUMN email;
```

Three releases. Each safe to deploy. Each safe to rollback.

### Combining strategies

Real teams mix:
- **Rolling update** for most releases (cheap, simple).
- **Blue-green** for high-confidence ones (atomic, easy rollback).
- **Canary** for risky ones (catch slow regressions before all users see them).

## Real-world usage

- **Default microservice deploy**: rolling update with N=2-4 replicas, 1-at-a-time, healthcheck-gated.
- **Quarterly major release**: blue-green for clean rollback story.
- **Risky feature**: canary at 5% for 24 hours, ramp if metrics OK.
- **Public-facing API with WebSockets**: long grace period + client reconnect logic + rolling update.
- **Database-heavy services**: expand-then-contract migrations always.

### Common mistakes

**No healthcheck or wrong healthcheck**

A healthcheck that just hits TCP port is not enough — the app might be listening but not ready. Implement `/health` that verifies DB, downstream services, and config.

**App ignores SIGTERM**

Many frameworks need explicit signal handlers. Default Node.js process exits immediately on SIGTERM. Add a handler.

**Sticky sessions broken across deploy**

If sessions live in-memory tied to one replica, redeploys log users out. Externalize sessions (Redis, JWT).

**No rollback plan**

"Just push the old image" sounds simple until the schema migration is partially applied. Have a rehearsed rollback before the deploy.

**Confusing deploy strategy with downtime**

A rolling update with no graceful shutdown still has downtime per replica. Strategy + plumbing together = zero downtime.

### Follow-up questions

**Q:** What is `stop_grace_period`?

**A:** Time between SIGTERM and SIGKILL when stopping a container. Set high enough for graceful shutdown to finish (default 10s; for HTTP services with slow requests, 30-60s).

**Q:** Do healthchecks need to be public?

**A:** No, the orchestrator and LB hit them internally. In fact, a public health endpoint can leak useful info to attackers. Bind to localhost or a private interface, or require an auth token.

**Q:** How do I know if my deploy was zero-downtime?

**A:** Run a synthetic load test (k6, ab, vegeta) during the deploy. Watch error rate. If 0% during the rollout, it was zero-downtime.

**Q:** (Senior) How do you handle a deploy that needs a long-running migration?

**A:** Decouple migration from deploy. Run the migration job as a one-shot container before deploying the new app version. The app version that needs the new schema deploys only after the migration completes. Tools like Flyway, Liquibase, golang-migrate let you script this. Combine with feature flags so the new code paths stay dark until the migration is verified.

**Q:** (Senior) How does observability change with these strategies?

**A:** You need to identify which version is serving any given request. Add the image tag/digest as a metric label and log field. During canary, you can compare metrics between stable and canary cohorts (error rate, latency, saturation) and trigger automatic rollback if the canary diverges. This is the core idea behind progressive delivery tools (Flagger, Argo Rollouts): the strategy is automated by metric SLOs, not by humans watching dashboards.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1624 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.