Suggest an editImprove this articleRefine the answer for “Zero-downtime deployment approaches with Docker”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)Three families: 1. **Rolling update** — replace tasks one (or N) at a time. Default in Swarm/K8s. Cheap. 2. **Blue-green** — run two full environments; flip traffic atomically. Fast rollback, 2x resources. 3. **Canary** — route a small % to the new version; ramp up if metrics are good. Catches slow-burn issues. **Cross-cutting requirements:** - **Healthchecks** — load balancer must know which container is ready. - **Graceful shutdown** — handle SIGTERM, drain in-flight requests, then exit. - **Backward-compatible DB migrations** — expand-then-contract; never breaking changes within a deploy window. - **Connection draining** — give clients time to finish or reconnect. ```bash # Swarm rolling update docker service update \ --image myorg/api:2.0 \ --update-parallelism 1 \ --update-delay 30s \ --update-failure-action rollback \ api ```Shown above the full answer for quick recall.Answer (EN)Image**Zero-downtime deployment** means upgrading a running service without dropping requests, breaking sessions, or returning errors during the rollout. With Docker, three deploy strategies cover most cases: rolling updates, blue-green, and canary. The strategy is half the picture — the other half is healthchecks, graceful shutdown, and DB migration discipline. ## Theory ### TL;DR - **Rolling update**: gradually replace replicas. Cheapest, slowest cutover, partial-state during rollout. - **Blue-green**: run two full environments; flip traffic at once. Atomic, instant rollback, 2x cost. - **Canary**: shift a small fraction first, ramp up if healthy. Catches subtle regressions. - **Required ingredients**: - Healthchecks (the LB must know who is ready) - Graceful shutdown (handle SIGTERM, finish in-flight, exit) - DB migrations are expand-then-contract - Connection draining ### Strategy comparison | Strategy | Atomicity | Rollback speed | Resource cost | Best for | |---|---|---|---|---| | Rolling update | Gradual (N at a time) | Slow (re-deploy old) | 1.0-1.2x | Default for stateless services | | Blue-green | Atomic flip | Instant (flip back) | 2x during deploy | High-confidence releases | | Canary | Gradual (% traffic) | Stop ramp + drain | 1.05-1.5x | Risky changes, want to catch slow regressions | ### What goes wrong without proper plumbing - **No healthchecks**: load balancer routes to a container that has not finished startup, returns 502. - **No graceful shutdown**: in-flight requests get dropped when the old container is killed. - **Breaking DB migration**: new code expects the new column; old code crashes when the column gets dropped during the cutover. - **No connection draining**: long-lived connections (WebSockets, HTTP/2) get severed. - **Wrong restart policy**: replicas crash and never come back. Fix each piece before strategy choice matters. ### Healthchecks A healthcheck tells the orchestrator (or load balancer) when a replica is ready to receive traffic. **Two types:** 1. **Liveness**: "Is the process alive?" If not, restart. 2. **Readiness**: "Is the process ready for traffic?" If not, remove from LB rotation. Readiness is the one that enables zero-downtime. During startup, readiness should return false until DB connections, caches, and warmup are done. ```dockerfile HEALTHCHECK --interval=10s --timeout=3s --start-period=30s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1 ``` App implements `/health` to return 200 only when ready. ### Graceful shutdown When the orchestrator stops a container: 1. Sends SIGTERM. 2. Waits up to `stop_grace_period` (default 10s). 3. Sends SIGKILL. During SIGTERM-to-SIGKILL window, the app should: 1. Stop accepting new connections (close listening socket). 2. Finish in-flight requests. 3. Cleanly shut down DB connections, flush logs, exit. Go example: ```go sigs := make(chan os.Signal, 1) signal.Notify(sigs, syscall.SIGTERM) <-sigs ctx, cancel := context.WithTimeout(context.Background(), 25 * time.Second) defer cancel() server.Shutdown(ctx) // waits for in-flight requests ``` Node.js: ```js process.on('SIGTERM', async () => { server.close() // stops accepting new await pool.end() // close DB pool process.exit(0) }) ``` Set `stop_grace_period: 30s` if your shutdown can take that long. ### DB migration discipline **Bad: breaking migration during deploy** Deploy app v2 + run `ALTER TABLE users DROP COLUMN old_field` simultaneously. App v1 still queries `old_field`, errors during the cutover window. The whole deploy looks broken. **Good: expand-then-contract over multiple deploys** 1. **Expand** (deploy 1): add new structure (new column, new table). v1 still works because old structure is intact. 2. **Migrate code** (deploy 2): app v2 reads/writes both old and new. v1 and v2 coexist during the cutover. 3. **Contract** (deploy 3): once all v1 is gone, drop the old structure. Three deploys for one logical change, but each is safe. ## Examples ### Strategy 1: Rolling update (Swarm) ```bash docker service create \ --name api \ --replicas 4 \ --update-parallelism 1 \ --update-delay 30s \ --update-failure-action rollback \ --update-monitor 30s \ --update-max-failure-ratio 0.0 \ --health-cmd 'curl -f http://localhost:8080/health' \ --health-interval 10s \ --health-start-period 30s \ -p 8080:8080 \ myorg/api:1.0 # Update to v2 docker service update --image myorg/api:2.0 api ``` **What happens:** - Swarm stops 1 replica (sends SIGTERM, waits, kills). - Starts 1 new replica with v2. - Waits for it to pass healthcheck. - Waits 30s monitor period. - If healthy: repeat for next replica. - If unhealthy: stop and rollback. Use `--update-parallelism 2` to update 2 at once (faster, slightly more risk). ### Strategy 2: Blue-green (Compose + reverse proxy) ```yaml # compose.yaml — blue active services: traefik: image: traefik:v3 command: - --providers.docker - --entrypoints.web.address=:80 ports: ["80:80"] volumes: ["/var/run/docker.sock:/var/run/docker.sock:ro"] api-blue: image: myorg/api:1.0 labels: - traefik.enable=true - 'traefik.http.routers.api.rule=Host(`api.example.com`)' - traefik.http.services.api.loadbalancer.server.port=8080 ``` Deploy v2: ```bash # Bring up green WITHOUT traffic docker run -d --name api-green --network=trafnet myorg/api:2.0 # Smoke-test green directly docker run --rm --network=trafnet curlimages/curl curl -f http://api-green:8080/health # Cutover: switch labels to green # (most easily done by docker-compose with new file or via Swarm services) # Traefik picks up the change in seconds # Drain in-flight on blue sleep 30 # Stop blue docker stop api-blue && docker rm api-blue ``` Rollback: ```bash # Revert labels back to blue (which is still around) # Or, if blue was removed: docker run -d --name api-blue --network=trafnet myorg/api:1.0 # Cut traffic back ``` ### Strategy 3: Canary (Traefik weighted routing) ```yaml # Two services with weighted load balancer http: services: api: weighted: services: - name: api-stable weight: 90 - name: api-canary weight: 10 ``` ```bash # Deploy canary docker run -d --name api-canary --network=trafnet \ --label "traefik.http.routers.canary.rule=Host(\`api.example.com\`)" \ myorg/api:2.0 # Watch metrics for 30 min # If healthy, ramp to 50/50, then 0/100 # If problems, set canary weight to 0 and remove ``` Kubernetes / Argo Rollouts / Flagger automate this with metric-driven analysis ("if error rate > 1% over 5 min, rollback"). ### Connection draining (Swarm/Compose) ```yaml services: api: image: myorg/api:1.0 stop_grace_period: 30s # how long to wait between SIGTERM and SIGKILL healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 10s timeout: 3s start_period: 30s ``` Combined with a graceful-shutdown handler in the app, in-flight requests complete during the 30s window. ### Long-lived connections (WebSockets, gRPC streams) These do not gracefully drain in 30 seconds — clients hold them indefinitely. Options: - **Implement reconnect** in the client. Server-side: `Connection: close` for HTTP/1.1, `GOAWAY` for HTTP/2/gRPC, server-side close for WebSockets. Client reconnects, lands on the new replica. - **Long grace period**: set `stop_grace_period: 10m` so connections drain naturally over 10 minutes. - **Sticky pool of "old" replicas** that are not in rotation but accept the existing connections; new connections go to new replicas. Trickier to orchestrate. ### Database migrations in production ```sql -- Step 1 (deploy 1): expand ALTER TABLE users ADD COLUMN email_canonical VARCHAR(255); -- Old code: ignores it. New code: writes to both old and new. -- Backfill (between deploys) UPDATE users SET email_canonical = LOWER(email) WHERE email_canonical IS NULL; -- Step 2 (deploy 2): code migrates fully to new column -- App reads from email_canonical, writes to both for safety -- Step 3 (deploy 3): contract ALTER TABLE users DROP COLUMN email; ``` Three releases. Each safe to deploy. Each safe to rollback. ### Combining strategies Real teams mix: - **Rolling update** for most releases (cheap, simple). - **Blue-green** for high-confidence ones (atomic, easy rollback). - **Canary** for risky ones (catch slow regressions before all users see them). ## Real-world usage - **Default microservice deploy**: rolling update with N=2-4 replicas, 1-at-a-time, healthcheck-gated. - **Quarterly major release**: blue-green for clean rollback story. - **Risky feature**: canary at 5% for 24 hours, ramp if metrics OK. - **Public-facing API with WebSockets**: long grace period + client reconnect logic + rolling update. - **Database-heavy services**: expand-then-contract migrations always. ### Common mistakes **No healthcheck or wrong healthcheck** A healthcheck that just hits TCP port is not enough — the app might be listening but not ready. Implement `/health` that verifies DB, downstream services, and config. **App ignores SIGTERM** Many frameworks need explicit signal handlers. Default Node.js process exits immediately on SIGTERM. Add a handler. **Sticky sessions broken across deploy** If sessions live in-memory tied to one replica, redeploys log users out. Externalize sessions (Redis, JWT). **No rollback plan** "Just push the old image" sounds simple until the schema migration is partially applied. Have a rehearsed rollback before the deploy. **Confusing deploy strategy with downtime** A rolling update with no graceful shutdown still has downtime per replica. Strategy + plumbing together = zero downtime. ### Follow-up questions **Q:** What is `stop_grace_period`? **A:** Time between SIGTERM and SIGKILL when stopping a container. Set high enough for graceful shutdown to finish (default 10s; for HTTP services with slow requests, 30-60s). **Q:** Do healthchecks need to be public? **A:** No, the orchestrator and LB hit them internally. In fact, a public health endpoint can leak useful info to attackers. Bind to localhost or a private interface, or require an auth token. **Q:** How do I know if my deploy was zero-downtime? **A:** Run a synthetic load test (k6, ab, vegeta) during the deploy. Watch error rate. If 0% during the rollout, it was zero-downtime. **Q:** (Senior) How do you handle a deploy that needs a long-running migration? **A:** Decouple migration from deploy. Run the migration job as a one-shot container before deploying the new app version. The app version that needs the new schema deploys only after the migration completes. Tools like Flyway, Liquibase, golang-migrate let you script this. Combine with feature flags so the new code paths stay dark until the migration is verified. **Q:** (Senior) How does observability change with these strategies? **A:** You need to identify which version is serving any given request. Add the image tag/digest as a metric label and log field. During canary, you can compare metrics between stable and canary cohorts (error rate, latency, saturation) and trigger automatic rollback if the canary diverges. This is the core idea behind progressive delivery tools (Flagger, Argo Rollouts): the strategy is automated by metric SLOs, not by humans watching dashboards.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.