Zero-downtime deployment approaches with Docker

docs.questions.sections.docker~6 min read

Zero-downtime deployment means upgrading a running service without dropping requests, breaking sessions, or returning errors during the rollout. With Docker, three deploy strategies cover most cases: rolling updates, blue-green, and canary. The strategy is half the picture — the other half is healthchecks, graceful shutdown, and DB migration discipline.

Theory

TL;DR

Rolling update: gradually replace replicas. Cheapest, slowest cutover, partial-state during rollout.
Blue-green: run two full environments; flip traffic at once. Atomic, instant rollback, 2x cost.
Canary: shift a small fraction first, ramp up if healthy. Catches subtle regressions.
Required ingredients:
- Healthchecks (the LB must know who is ready)
- Graceful shutdown (handle SIGTERM, finish in-flight, exit)
- DB migrations are expand-then-contract
- Connection draining

Strategy comparison

Strategy	Atomicity	Rollback speed	Resource cost	Best for
Rolling update	Gradual (N at a time)	Slow (re-deploy old)	1.0-1.2x	Default for stateless services
Blue-green	Atomic flip	Instant (flip back)	2x during deploy	High-confidence releases
Canary	Gradual (% traffic)	Stop ramp + drain	1.05-1.5x	Risky changes, want to catch slow regressions

What goes wrong without proper plumbing

No healthchecks: load balancer routes to a container that has not finished startup, returns 502.
No graceful shutdown: in-flight requests get dropped when the old container is killed.
Breaking DB migration: new code expects the new column; old code crashes when the column gets dropped during the cutover.
No connection draining: long-lived connections (WebSockets, HTTP/2) get severed.
Wrong restart policy: replicas crash and never come back.

Fix each piece before strategy choice matters.

Healthchecks

A healthcheck tells the orchestrator (or load balancer) when a replica is ready to receive traffic.

Two types:

Liveness: "Is the process alive?" If not, restart.
Readiness: "Is the process ready for traffic?" If not, remove from LB rotation.

Readiness is the one that enables zero-downtime. During startup, readiness should return false until DB connections, caches, and warmup are done.

dockerfile

HEALTHCHECK --interval=10s --timeout=3s --start-period=30s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

App implements /health to return 200 only when ready.

Graceful shutdown

When the orchestrator stops a container:

Sends SIGTERM.
Waits up to stop_grace_period (default 10s).
Sends SIGKILL.

During SIGTERM-to-SIGKILL window, the app should:

Stop accepting new connections (close listening socket).
Finish in-flight requests.
Cleanly shut down DB connections, flush logs, exit.

Go example:

sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGTERM)
<-sigs
ctx, cancel := context.WithTimeout(context.Background(), 25 * time.Second)
defer cancel()
server.Shutdown(ctx)   // waits for in-flight requests

Node.js:

process.on('SIGTERM', async () => {
  server.close()  // stops accepting new
  await pool.end()  // close DB pool
  process.exit(0)
})

Set stop_grace_period: 30s if your shutdown can take that long.

DB migration discipline

Bad: breaking migration during deploy

Deploy app v2 + run ALTER TABLE users DROP COLUMN old_field simultaneously. App v1 still queries old_field, errors during the cutover window. The whole deploy looks broken.

Good: expand-then-contract over multiple deploys

Expand (deploy 1): add new structure (new column, new table). v1 still works because old structure is intact.
Migrate code (deploy 2): app v2 reads/writes both old and new. v1 and v2 coexist during the cutover.
Contract (deploy 3): once all v1 is gone, drop the old structure.

Three deploys for one logical change, but each is safe.

Examples

Strategy 1: Rolling update (Swarm)

bash

docker service create \
    --name api \
    --replicas 4 \
    --update-parallelism 1 \
    --update-delay 30s \
    --update-failure-action rollback \
    --update-monitor 30s \
    --update-max-failure-ratio 0.0 \
    --health-cmd 'curl -f http://localhost:8080/health' \
    --health-interval 10s \
    --health-start-period 30s \
    -p 8080:8080 \
    myorg/api:1.0

# Update to v2
docker service update --image myorg/api:2.0 api

What happens:

Swarm stops 1 replica (sends SIGTERM, waits, kills).
Starts 1 new replica with v2.
Waits for it to pass healthcheck.
Waits 30s monitor period.
If healthy: repeat for next replica.
If unhealthy: stop and rollback.

Use --update-parallelism 2 to update 2 at once (faster, slightly more risk).

Strategy 2: Blue-green (Compose + reverse proxy)

yaml

# compose.yaml — blue active
services:
  traefik:
    image: traefik:v3
    command:
      - --providers.docker
      - --entrypoints.web.address=:80
    ports: ["80:80"]
    volumes: ["/var/run/docker.sock:/var/run/docker.sock:ro"]

  api-blue:
    image: myorg/api:1.0
    labels:
      - traefik.enable=true
      - 'traefik.http.routers.api.rule=Host(`api.example.com`)'
      - traefik.http.services.api.loadbalancer.server.port=8080

Deploy v2:

bash

# Bring up green WITHOUT traffic
docker run -d --name api-green --network=trafnet myorg/api:2.0

# Smoke-test green directly
docker run --rm --network=trafnet curlimages/curl curl -f http://api-green:8080/health

# Cutover: switch labels to green
# (most easily done by docker-compose with new file or via Swarm services)
# Traefik picks up the change in seconds

# Drain in-flight on blue
sleep 30

# Stop blue
docker stop api-blue && docker rm api-blue

Rollback:

bash

# Revert labels back to blue (which is still around)
# Or, if blue was removed:
docker run -d --name api-blue --network=trafnet myorg/api:1.0
# Cut traffic back

Strategy 3: Canary (Traefik weighted routing)

yaml

# Two services with weighted load balancer
http:
  services:
    api:
      weighted:
        services:
          - name: api-stable
            weight: 90
          - name: api-canary
            weight: 10

bash

# Deploy canary
docker run -d --name api-canary --network=trafnet \
    --label "traefik.http.routers.canary.rule=Host(\`api.example.com\`)" \
    myorg/api:2.0

# Watch metrics for 30 min
# If healthy, ramp to 50/50, then 0/100
# If problems, set canary weight to 0 and remove

Kubernetes / Argo Rollouts / Flagger automate this with metric-driven analysis ("if error rate > 1% over 5 min, rollback").

Connection draining (Swarm/Compose)

yaml

services:
  api:
    image: myorg/api:1.0
    stop_grace_period: 30s    # how long to wait between SIGTERM and SIGKILL
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 3s
      start_period: 30s

Combined with a graceful-shutdown handler in the app, in-flight requests complete during the 30s window.

Long-lived connections (WebSockets, gRPC streams)

These do not gracefully drain in 30 seconds — clients hold them indefinitely.

Options:

Implement reconnect in the client. Server-side: Connection: close for HTTP/1.1, GOAWAY for HTTP/2/gRPC, server-side close for WebSockets. Client reconnects, lands on the new replica.
Long grace period: set stop_grace_period: 10m so connections drain naturally over 10 minutes.
Sticky pool of "old" replicas that are not in rotation but accept the existing connections; new connections go to new replicas. Trickier to orchestrate.

Database migrations in production

sql

-- Step 1 (deploy 1): expand
ALTER TABLE users ADD COLUMN email_canonical VARCHAR(255);
-- Old code: ignores it. New code: writes to both old and new.

-- Backfill (between deploys)
UPDATE users SET email_canonical = LOWER(email) WHERE email_canonical IS NULL;

-- Step 2 (deploy 2): code migrates fully to new column
-- App reads from email_canonical, writes to both for safety

-- Step 3 (deploy 3): contract
ALTER TABLE users DROP COLUMN email;

Three releases. Each safe to deploy. Each safe to rollback.

Combining strategies

Real teams mix:

Rolling update for most releases (cheap, simple).
Blue-green for high-confidence ones (atomic, easy rollback).
Canary for risky ones (catch slow regressions before all users see them).

Real-world usage

Default microservice deploy: rolling update with N=2-4 replicas, 1-at-a-time, healthcheck-gated.
Quarterly major release: blue-green for clean rollback story.
Risky feature: canary at 5% for 24 hours, ramp if metrics OK.
Public-facing API with WebSockets: long grace period + client reconnect logic + rolling update.
Database-heavy services: expand-then-contract migrations always.

Common mistakes

No healthcheck or wrong healthcheck

A healthcheck that just hits TCP port is not enough — the app might be listening but not ready. Implement /health that verifies DB, downstream services, and config.

App ignores SIGTERM

Many frameworks need explicit signal handlers. Default Node.js process exits immediately on SIGTERM. Add a handler.

Sticky sessions broken across deploy

If sessions live in-memory tied to one replica, redeploys log users out. Externalize sessions (Redis, JWT).

No rollback plan

"Just push the old image" sounds simple until the schema migration is partially applied. Have a rehearsed rollback before the deploy.

Confusing deploy strategy with downtime

A rolling update with no graceful shutdown still has downtime per replica. Strategy + plumbing together = zero downtime.

Follow-up questions

Q: What is stop_grace_period?

A: Time between SIGTERM and SIGKILL when stopping a container. Set high enough for graceful shutdown to finish (default 10s; for HTTP services with slow requests, 30-60s).

Q: Do healthchecks need to be public?

A: No, the orchestrator and LB hit them internally. In fact, a public health endpoint can leak useful info to attackers. Bind to localhost or a private interface, or require an auth token.

Q: How do I know if my deploy was zero-downtime?

A: Run a synthetic load test (k6, ab, vegeta) during the deploy. Watch error rate. If 0% during the rollout, it was zero-downtime.

Q: (Senior) How do you handle a deploy that needs a long-running migration?

A: Decouple migration from deploy. Run the migration job as a one-shot container before deploying the new app version. The app version that needs the new schema deploys only after the migration completes. Tools like Flyway, Liquibase, golang-migrate let you script this. Combine with feature flags so the new code paths stay dark until the migration is verified.

Q: (Senior) How does observability change with these strategies?

A: You need to identify which version is serving any given request. Add the image tag/digest as a metric label and log field. During canary, you can compare metrics between stable and canary cohorts (error rate, latency, saturation) and trigger automatic rollback if the canary diverges. This is the core idea behind progressive delivery tools (Flagger, Argo Rollouts): the strategy is automated by metric SLOs, not by humans watching dashboards.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet