How to perform a rolling update in Docker Swarm?
Rolling updates in Docker Swarm replace running tasks one batch at a time, waiting for health between batches. The Swarm built-in machinery is genuinely good at this, with auto-rollback on failure as a first-class feature.
Theory
TL;DR
docker service updatewith--imageis the trigger. Swarm replaces tasks perupdate_configpolicy.- Key parameters: parallelism (how many at a time), delay (between batches), monitor (how long to watch each batch), failure-action (continue/pause/rollback).
- Order:
stop-first(default, brief gap per task) orstart-first(zero downtime if app supports concurrent old/new). - Rollback is a single command (
docker service rollback) or automatic on failure. - Healthcheck on the service is what makes "failure" detectable. Without it, Swarm assumes started=healthy.
The update flow
Services: 6 replicas of api:1.0
--update-parallelism=2 --update-delay=30s
t=0: [v1.0 v1.0 v1.0 v1.0 v1.0 v1.0] issue update
t=0: [STOP STOP v1.0 v1.0 v1.0 v1.0] stop 2 (or start-first: extra v1.1 spawned)
t=5: [v1.1 v1.1 v1.0 v1.0 v1.0 v1.0] 2 new tasks healthy
t=35: [v1.1 v1.1 STOP STOP v1.0 v1.0] delay+30s, next batch
t=40: [v1.1 v1.1 v1.1 v1.1 v1.0 v1.0]
t=70: [v1.1 v1.1 v1.1 v1.1 v1.1 v1.1] doneDuring the update, traffic continues to whichever replicas are healthy.
Imperative form (CLI)
docker service update \
--image myorg/api:1.1 \
--update-parallelism 1 \
--update-delay 30s \
--update-monitor 30s \
--update-failure-action rollback \
--update-max-failure-ratio 0.2 \
--update-order start-first \
apiWhat each flag does:
--update-parallelism N— replace N tasks at a time (default 1).--update-delay 30s— wait between batches.--update-monitor 30s— watch each batch for failures for this long.--update-failure-action <continue|pause|rollback>— what to do on failure.--update-max-failure-ratio 0.2— at most 20% of tasks can fail before triggering action.--update-order <stop-first|start-first>— replace by stopping first, or starting new first.
Declarative form (stack file)
version: '3.9'
services:
api:
image: myorg/api:1.0
deploy:
replicas: 6
update_config:
parallelism: 1
delay: 30s
order: start-first
failure_action: rollback
monitor: 30s
max_failure_ratio: 0.2
rollback_config:
parallelism: 2
delay: 5s
failure_action: pausedocker stack deploy -c stack.yaml mystack
# Edit image to 1.1, redeploy → triggers rolling update with config above.The stack file is the canonical place — version-controlled, reviewable.
Health-driven gating
Swarm decides "is this batch healthy?" by:
- Container started successfully (no exit during monitor period).
- If
healthcheckis defined, container ishealthy. - No more than
max_failure_ratiofailures in the batch.
Without a healthcheck, Swarm only knows "the process started". An app that starts but immediately misbehaves still counts as "healthy" to Swarm. Healthchecks are essential for safe rolling updates.
Rollback
# Manual rollback at any time
docker service rollback api
# Reverts to the previous image tagOr via failure_action: rollback, Swarm rolls back automatically when failure-ratio is exceeded. Combined with monitor, you get "if 1 of 5 in the new batch is unhealthy after 30 seconds, roll back the whole service" semantics.
start-first vs stop-first
order: stop-first # default — slight gap per task
order: start-first # spin up new alongside old, then drain oldstart-first is the path to true zero-downtime, but requires the app to tolerate brief overlap (two versions running, briefly). For stateless web/API, fine. For workers with strict singleton semantics, may need code changes.
Common mistakes
Updating without a healthcheck
services:
api:
image: myorg/api
# NO healthcheck → Swarm cannot detect bad versionsWithout healthcheck, a broken new image rolls out to all replicas before failure becomes visible. Add healthcheck: to make Swarm gate progression on actual app readiness.
Setting parallelism too high
update_config:
parallelism: 5 # all 6 replicas at onceDuring the brief replacement window, you have very few healthy tasks. A spike in load = pile-up. Lower parallelism = safer.
Forgetting rollback_config
The rollback uses its own configuration block. If you only set update_config, rollback uses defaults (often slower than you want). Define rollback_config explicitly.
Image tag still latest for --rollback
docker service rollback api
No previous image to roll back to: same tagIf both new and old were tagged latest, Swarm cannot distinguish them. Always tag with a version (or commit SHA) so rollback works.
Real-world usage
- Production deploys on Swarm clusters — every new image triggers
service update; Swarm handles parallelism + monitoring. - Staged canary — first deploy 1 of 10 with
parallelism=1and a long monitor; if it stabilizes, raise parallelism for the rest. - Hotfix rollouts —
service update --image hotfix:1.0with high parallelism (faster) and aggressive monitoring (catch failures fast). - Database migrations — never with rolling update directly. Run a one-off migrator service first, then update app replicas.
Follow-up questions
Q: What happens to in-flight requests during a task replacement?
A: Tasks scheduled for replacement get SIGTERM and the configured grace period (stop_grace_period). Apps should drain in-flight requests before exiting. Combined with the routing mesh, traffic is steered away from stopping tasks before SIGTERM.
Q: Can I update multiple services together?
A: Edit the stack file with new images for each, then docker stack deploy -c stack.yaml mystack. Each service updates independently per its own config; you do not get cross-service ordering.
Q: How is Swarm rolling update different from K8s rolling update?
A: Conceptually identical. K8s deployment: maxSurge, maxUnavailable ≈ Swarm's parallelism and order. K8s readiness probes ≈ Swarm healthchecks. Same model, different syntax.
Q: What is the difference between update_config and rollback_config?
A: update_config controls forward updates (1.0 → 1.1). rollback_config controls reverse updates (1.1 → 1.0). Often you want a slower, safer rollback than the forward update.
Q: (Senior) How would you design rolling-update parameters for a service that takes 90 seconds to warm up?
A: start_period in healthcheck = 120s (give warmup time before counting failures). update-monitor = 180s (wait long enough to see real failures emerge). parallelism = 1 (slow rollout, 90s warmup × replicas = total update time). failure_action = rollback. The pattern: monitor-period > start-period > observation needed for stability. Faster rollouts hide warmup-related failures; this conservative config catches them.
Examples
Production-quality rollout
version: '3.9'
services:
api:
image: myorg/api:1.0
deploy:
replicas: 6
update_config:
parallelism: 2
delay: 30s
order: start-first
failure_action: rollback
monitor: 60s
max_failure_ratio: 0.2
rollback_config:
parallelism: 2
delay: 10s
restart_policy:
condition: any
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 3s
retries: 3
start_period: 30sDeploy with new image:
sed -i 's/myorg\/api:1.0/myorg\/api:1.1/' stack.yaml
docker stack deploy -c stack.yaml mystack
docker service ps mystack_api
# Watch tasks replace 2 at a time, with 30s gap, monitored for 60s each.Manual rollout with imperative flags
docker service update \
--image myorg/api:1.1 \
--update-parallelism 1 \
--update-delay 60s \
--update-monitor 120s \
--update-failure-action rollback \
--update-max-failure-ratio 0.0 \
--update-order start-first \
api
# Strict: any failure triggers rollback.Useful for one-off tightly-controlled rollouts.
Watching a rollout
$ watch -n 2 'docker service ps mystack_api --format "table {{.Name}}\t{{.Image}}\t{{.CurrentState}}"'
# Live view of which tasks are which version, in which state.Great for verifying that the rollout is progressing as expected.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.
Comments
No comments yet