Skip to main content

Zero-downtime deployment approaches with Docker

Zero-downtime deployment means upgrading a running service without dropping requests, breaking sessions, or returning errors during the rollout. With Docker, three deploy strategies cover most cases: rolling updates, blue-green, and canary. The strategy is half the picture — the other half is healthchecks, graceful shutdown, and DB migration discipline.

Theory

TL;DR

  • Rolling update: gradually replace replicas. Cheapest, slowest cutover, partial-state during rollout.
  • Blue-green: run two full environments; flip traffic at once. Atomic, instant rollback, 2x cost.
  • Canary: shift a small fraction first, ramp up if healthy. Catches subtle regressions.
  • Required ingredients:
    • Healthchecks (the LB must know who is ready)
    • Graceful shutdown (handle SIGTERM, finish in-flight, exit)
    • DB migrations are expand-then-contract
    • Connection draining

Strategy comparison

StrategyAtomicityRollback speedResource costBest for
Rolling updateGradual (N at a time)Slow (re-deploy old)1.0-1.2xDefault for stateless services
Blue-greenAtomic flipInstant (flip back)2x during deployHigh-confidence releases
CanaryGradual (% traffic)Stop ramp + drain1.05-1.5xRisky changes, want to catch slow regressions

What goes wrong without proper plumbing

  • No healthchecks: load balancer routes to a container that has not finished startup, returns 502.
  • No graceful shutdown: in-flight requests get dropped when the old container is killed.
  • Breaking DB migration: new code expects the new column; old code crashes when the column gets dropped during the cutover.
  • No connection draining: long-lived connections (WebSockets, HTTP/2) get severed.
  • Wrong restart policy: replicas crash and never come back.

Fix each piece before strategy choice matters.

Healthchecks

A healthcheck tells the orchestrator (or load balancer) when a replica is ready to receive traffic.

Two types:

  1. Liveness: "Is the process alive?" If not, restart.
  2. Readiness: "Is the process ready for traffic?" If not, remove from LB rotation.

Readiness is the one that enables zero-downtime. During startup, readiness should return false until DB connections, caches, and warmup are done.

dockerfile
HEALTHCHECK --interval=10s --timeout=3s --start-period=30s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1

App implements /health to return 200 only when ready.

Graceful shutdown

When the orchestrator stops a container:

  1. Sends SIGTERM.
  2. Waits up to stop_grace_period (default 10s).
  3. Sends SIGKILL.

During SIGTERM-to-SIGKILL window, the app should:

  1. Stop accepting new connections (close listening socket).
  2. Finish in-flight requests.
  3. Cleanly shut down DB connections, flush logs, exit.

Go example:

go
sigs := make(chan os.Signal, 1) signal.Notify(sigs, syscall.SIGTERM) <-sigs ctx, cancel := context.WithTimeout(context.Background(), 25 * time.Second) defer cancel() server.Shutdown(ctx) // waits for in-flight requests

Node.js:

js
process.on('SIGTERM', async () => { server.close() // stops accepting new await pool.end() // close DB pool process.exit(0) })

Set stop_grace_period: 30s if your shutdown can take that long.

DB migration discipline

Bad: breaking migration during deploy

Deploy app v2 + run ALTER TABLE users DROP COLUMN old_field simultaneously. App v1 still queries old_field, errors during the cutover window. The whole deploy looks broken.

Good: expand-then-contract over multiple deploys

  1. Expand (deploy 1): add new structure (new column, new table). v1 still works because old structure is intact.
  2. Migrate code (deploy 2): app v2 reads/writes both old and new. v1 and v2 coexist during the cutover.
  3. Contract (deploy 3): once all v1 is gone, drop the old structure.

Three deploys for one logical change, but each is safe.

Examples

Strategy 1: Rolling update (Swarm)

bash
docker service create \ --name api \ --replicas 4 \ --update-parallelism 1 \ --update-delay 30s \ --update-failure-action rollback \ --update-monitor 30s \ --update-max-failure-ratio 0.0 \ --health-cmd 'curl -f http://localhost:8080/health' \ --health-interval 10s \ --health-start-period 30s \ -p 8080:8080 \ myorg/api:1.0 # Update to v2 docker service update --image myorg/api:2.0 api

What happens:

  • Swarm stops 1 replica (sends SIGTERM, waits, kills).
  • Starts 1 new replica with v2.
  • Waits for it to pass healthcheck.
  • Waits 30s monitor period.
  • If healthy: repeat for next replica.
  • If unhealthy: stop and rollback.

Use --update-parallelism 2 to update 2 at once (faster, slightly more risk).

Strategy 2: Blue-green (Compose + reverse proxy)

yaml
# compose.yaml — blue active services: traefik: image: traefik:v3 command: - --providers.docker - --entrypoints.web.address=:80 ports: ["80:80"] volumes: ["/var/run/docker.sock:/var/run/docker.sock:ro"] api-blue: image: myorg/api:1.0 labels: - traefik.enable=true - 'traefik.http.routers.api.rule=Host(`api.example.com`)' - traefik.http.services.api.loadbalancer.server.port=8080

Deploy v2:

bash
# Bring up green WITHOUT traffic docker run -d --name api-green --network=trafnet myorg/api:2.0 # Smoke-test green directly docker run --rm --network=trafnet curlimages/curl curl -f http://api-green:8080/health # Cutover: switch labels to green # (most easily done by docker-compose with new file or via Swarm services) # Traefik picks up the change in seconds # Drain in-flight on blue sleep 30 # Stop blue docker stop api-blue && docker rm api-blue

Rollback:

bash
# Revert labels back to blue (which is still around) # Or, if blue was removed: docker run -d --name api-blue --network=trafnet myorg/api:1.0 # Cut traffic back

Strategy 3: Canary (Traefik weighted routing)

yaml
# Two services with weighted load balancer http: services: api: weighted: services: - name: api-stable weight: 90 - name: api-canary weight: 10
bash
# Deploy canary docker run -d --name api-canary --network=trafnet \ --label "traefik.http.routers.canary.rule=Host(\`api.example.com\`)" \ myorg/api:2.0 # Watch metrics for 30 min # If healthy, ramp to 50/50, then 0/100 # If problems, set canary weight to 0 and remove

Kubernetes / Argo Rollouts / Flagger automate this with metric-driven analysis ("if error rate > 1% over 5 min, rollback").

Connection draining (Swarm/Compose)

yaml
services: api: image: myorg/api:1.0 stop_grace_period: 30s # how long to wait between SIGTERM and SIGKILL healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 10s timeout: 3s start_period: 30s

Combined with a graceful-shutdown handler in the app, in-flight requests complete during the 30s window.

Long-lived connections (WebSockets, gRPC streams)

These do not gracefully drain in 30 seconds — clients hold them indefinitely.

Options:

  • Implement reconnect in the client. Server-side: Connection: close for HTTP/1.1, GOAWAY for HTTP/2/gRPC, server-side close for WebSockets. Client reconnects, lands on the new replica.
  • Long grace period: set stop_grace_period: 10m so connections drain naturally over 10 minutes.
  • Sticky pool of "old" replicas that are not in rotation but accept the existing connections; new connections go to new replicas. Trickier to orchestrate.

Database migrations in production

sql
-- Step 1 (deploy 1): expand ALTER TABLE users ADD COLUMN email_canonical VARCHAR(255); -- Old code: ignores it. New code: writes to both old and new. -- Backfill (between deploys) UPDATE users SET email_canonical = LOWER(email) WHERE email_canonical IS NULL; -- Step 2 (deploy 2): code migrates fully to new column -- App reads from email_canonical, writes to both for safety -- Step 3 (deploy 3): contract ALTER TABLE users DROP COLUMN email;

Three releases. Each safe to deploy. Each safe to rollback.

Combining strategies

Real teams mix:

  • Rolling update for most releases (cheap, simple).
  • Blue-green for high-confidence ones (atomic, easy rollback).
  • Canary for risky ones (catch slow regressions before all users see them).

Real-world usage

  • Default microservice deploy: rolling update with N=2-4 replicas, 1-at-a-time, healthcheck-gated.
  • Quarterly major release: blue-green for clean rollback story.
  • Risky feature: canary at 5% for 24 hours, ramp if metrics OK.
  • Public-facing API with WebSockets: long grace period + client reconnect logic + rolling update.
  • Database-heavy services: expand-then-contract migrations always.

Common mistakes

No healthcheck or wrong healthcheck

A healthcheck that just hits TCP port is not enough — the app might be listening but not ready. Implement /health that verifies DB, downstream services, and config.

App ignores SIGTERM

Many frameworks need explicit signal handlers. Default Node.js process exits immediately on SIGTERM. Add a handler.

Sticky sessions broken across deploy

If sessions live in-memory tied to one replica, redeploys log users out. Externalize sessions (Redis, JWT).

No rollback plan

"Just push the old image" sounds simple until the schema migration is partially applied. Have a rehearsed rollback before the deploy.

Confusing deploy strategy with downtime

A rolling update with no graceful shutdown still has downtime per replica. Strategy + plumbing together = zero downtime.

Follow-up questions

Q: What is stop_grace_period?


A: Time between SIGTERM and SIGKILL when stopping a container. Set high enough for graceful shutdown to finish (default 10s; for HTTP services with slow requests, 30-60s).

Q: Do healthchecks need to be public?


A: No, the orchestrator and LB hit them internally. In fact, a public health endpoint can leak useful info to attackers. Bind to localhost or a private interface, or require an auth token.

Q: How do I know if my deploy was zero-downtime?


A: Run a synthetic load test (k6, ab, vegeta) during the deploy. Watch error rate. If 0% during the rollout, it was zero-downtime.

Q: (Senior) How do you handle a deploy that needs a long-running migration?


A: Decouple migration from deploy. Run the migration job as a one-shot container before deploying the new app version. The app version that needs the new schema deploys only after the migration completes. Tools like Flyway, Liquibase, golang-migrate let you script this. Combine with feature flags so the new code paths stay dark until the migration is verified.

Q: (Senior) How does observability change with these strategies?


A: You need to identify which version is serving any given request. Add the image tag/digest as a metric label and log field. During canary, you can compare metrics between stable and canary cohorts (error rate, latency, saturation) and trigger automatic rollback if the canary diverges. This is the core idea behind progressive delivery tools (Flagger, Argo Rollouts): the strategy is automated by metric SLOs, not by humans watching dashboards.

Short Answer

Interview ready
Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet