How to perform a rolling update in Docker Swarm?

docs.questions.sections.docker~4 min read

Rolling updates in Docker Swarm replace running tasks one batch at a time, waiting for health between batches. The Swarm built-in machinery is genuinely good at this, with auto-rollback on failure as a first-class feature.

Theory

TL;DR

docker service update with --image is the trigger. Swarm replaces tasks per update_config policy.
Key parameters: parallelism (how many at a time), delay (between batches), monitor (how long to watch each batch), failure-action (continue/pause/rollback).
Order: stop-first (default, brief gap per task) or start-first (zero downtime if app supports concurrent old/new).
Rollback is a single command (docker service rollback) or automatic on failure.
Healthcheck on the service is what makes "failure" detectable. Without it, Swarm assumes started=healthy.

The update flow

Services: 6 replicas of api:1.0
--update-parallelism=2 --update-delay=30s

t=0:    [v1.0 v1.0 v1.0 v1.0 v1.0 v1.0]   issue update
t=0:    [STOP STOP v1.0 v1.0 v1.0 v1.0]   stop 2 (or start-first: extra v1.1 spawned)
t=5:    [v1.1 v1.1 v1.0 v1.0 v1.0 v1.0]   2 new tasks healthy
t=35:   [v1.1 v1.1 STOP STOP v1.0 v1.0]   delay+30s, next batch
t=40:   [v1.1 v1.1 v1.1 v1.1 v1.0 v1.0]
t=70:   [v1.1 v1.1 v1.1 v1.1 v1.1 v1.1]   done

During the update, traffic continues to whichever replicas are healthy.

Imperative form (CLI)

bash

docker service update \
    --image myorg/api:1.1 \
    --update-parallelism 1 \
    --update-delay 30s \
    --update-monitor 30s \
    --update-failure-action rollback \
    --update-max-failure-ratio 0.2 \
    --update-order start-first \
    api

What each flag does:

--update-parallelism N — replace N tasks at a time (default 1).
--update-delay 30s — wait between batches.
--update-monitor 30s — watch each batch for failures for this long.
--update-failure-action <continue|pause|rollback> — what to do on failure.
--update-max-failure-ratio 0.2 — at most 20% of tasks can fail before triggering action.
--update-order <stop-first|start-first> — replace by stopping first, or starting new first.

Declarative form (stack file)

yaml

version: '3.9'
services:
  api:
    image: myorg/api:1.0
    deploy:
      replicas: 6
      update_config:
        parallelism: 1
        delay: 30s
        order: start-first
        failure_action: rollback
        monitor: 30s
        max_failure_ratio: 0.2
      rollback_config:
        parallelism: 2
        delay: 5s
        failure_action: pause

bash

docker stack deploy -c stack.yaml mystack
# Edit image to 1.1, redeploy → triggers rolling update with config above.

The stack file is the canonical place — version-controlled, reviewable.

Health-driven gating

Swarm decides "is this batch healthy?" by:

Container started successfully (no exit during monitor period).
If healthcheck is defined, container is healthy.
No more than max_failure_ratio failures in the batch.

Without a healthcheck, Swarm only knows "the process started". An app that starts but immediately misbehaves still counts as "healthy" to Swarm. Healthchecks are essential for safe rolling updates.

Rollback

bash

# Manual rollback at any time
docker service rollback api
# Reverts to the previous image tag

Or via failure_action: rollback, Swarm rolls back automatically when failure-ratio is exceeded. Combined with monitor, you get "if 1 of 5 in the new batch is unhealthy after 30 seconds, roll back the whole service" semantics.

`start-first` vs `stop-first`

yaml

order: stop-first    # default — slight gap per task
order: start-first   # spin up new alongside old, then drain old

start-first is the path to true zero-downtime, but requires the app to tolerate brief overlap (two versions running, briefly). For stateless web/API, fine. For workers with strict singleton semantics, may need code changes.

Common mistakes

Updating without a healthcheck

yaml

services:
  api:
    image: myorg/api
    # NO healthcheck → Swarm cannot detect bad versions

Without healthcheck, a broken new image rolls out to all replicas before failure becomes visible. Add healthcheck: to make Swarm gate progression on actual app readiness.

Setting parallelism too high

yaml

update_config:
  parallelism: 5    # all 6 replicas at once

During the brief replacement window, you have very few healthy tasks. A spike in load = pile-up. Lower parallelism = safer.

Forgetting rollback_config

The rollback uses its own configuration block. If you only set update_config, rollback uses defaults (often slower than you want). Define rollback_config explicitly.

Image tag still latest for --rollback

bash

docker service rollback api
No previous image to roll back to: same tag

If both new and old were tagged latest, Swarm cannot distinguish them. Always tag with a version (or commit SHA) so rollback works.

Real-world usage

Production deploys on Swarm clusters — every new image triggers service update; Swarm handles parallelism + monitoring.
Staged canary — first deploy 1 of 10 with parallelism=1 and a long monitor; if it stabilizes, raise parallelism for the rest.
Hotfix rollouts — service update --image hotfix:1.0 with high parallelism (faster) and aggressive monitoring (catch failures fast).
Database migrations — never with rolling update directly. Run a one-off migrator service first, then update app replicas.

Follow-up questions

Q: What happens to in-flight requests during a task replacement?

A: Tasks scheduled for replacement get SIGTERM and the configured grace period (stop_grace_period). Apps should drain in-flight requests before exiting. Combined with the routing mesh, traffic is steered away from stopping tasks before SIGTERM.

Q: Can I update multiple services together?

A: Edit the stack file with new images for each, then docker stack deploy -c stack.yaml mystack. Each service updates independently per its own config; you do not get cross-service ordering.

Q: How is Swarm rolling update different from K8s rolling update?

A: Conceptually identical. K8s deployment: maxSurge, maxUnavailable ≈ Swarm's parallelism and order. K8s readiness probes ≈ Swarm healthchecks. Same model, different syntax.

Q: What is the difference between update_config and rollback_config?

A: update_config controls forward updates (1.0 → 1.1). rollback_config controls reverse updates (1.1 → 1.0). Often you want a slower, safer rollback than the forward update.

Q: (Senior) How would you design rolling-update parameters for a service that takes 90 seconds to warm up?

A: start_period in healthcheck = 120s (give warmup time before counting failures). update-monitor = 180s (wait long enough to see real failures emerge). parallelism = 1 (slow rollout, 90s warmup × replicas = total update time). failure_action = rollback. The pattern: monitor-period > start-period > observation needed for stability. Faster rollouts hide warmup-related failures; this conservative config catches them.

Examples

Production-quality rollout

yaml

version: '3.9'
services:
  api:
    image: myorg/api:1.0
    deploy:
      replicas: 6
      update_config:
        parallelism: 2
        delay: 30s
        order: start-first
        failure_action: rollback
        monitor: 60s
        max_failure_ratio: 0.2
      rollback_config:
        parallelism: 2
        delay: 10s
      restart_policy:
        condition: any
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 30s

Deploy with new image:

bash

sed -i 's/myorg\/api:1.0/myorg\/api:1.1/' stack.yaml
docker stack deploy -c stack.yaml mystack
docker service ps mystack_api
# Watch tasks replace 2 at a time, with 30s gap, monitored for 60s each.

Manual rollout with imperative flags

bash

docker service update \
    --image myorg/api:1.1 \
    --update-parallelism 1 \
    --update-delay 60s \
    --update-monitor 120s \
    --update-failure-action rollback \
    --update-max-failure-ratio 0.0 \
    --update-order start-first \
    api
# Strict: any failure triggers rollback.

Useful for one-off tightly-controlled rollouts.

Watching a rollout

bash

$ watch -n 2 'docker service ps mystack_api --format "table {{.Name}}\t{{.Image}}\t{{.CurrentState}}"'
# Live view of which tasks are which version, in which state.

Great for verifying that the rollout is progressing as expected.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet