How to live migrate a Docker container between hosts using CRIU?
CRIU (Checkpoint/Restore In Userspace) is a Linux feature that freezes a running process tree to a set of files and restores it later, on the same host or another, ideally with the program none the wiser. Docker integrates CRIU through the experimental docker checkpoint subcommand, which is the closest thing Docker has to live container migration. It works for narrow use cases. For most teams, stateless redeploy is simpler and more reliable.
Theory
TL;DR
- CRIU dumps a process's memory, file descriptors, sockets, and namespaces to disk; restores them on demand.
docker checkpoint createproduces a checkpoint;docker start --checkpointresumes from it.- Requires
"experimental": trueindaemon.jsonand a kernel with CRIU support. - Both hosts need same kernel version, same CPU instruction set, same Docker version, identical filesystem state at the time of the snapshot.
- Network state is finicky: the IP/MAC of the container moves with the migration; needs network to allow this.
- Use case is narrow: long-running computations where you can't afford to restart from scratch (HPC, ML training, long simulations).
- Most production patterns (web apps, microservices) use stateless redeploy instead, where state lives outside the container (DB, object storage).
How CRIU works
- Stop all processes in the target's PID namespace (using
ptrace). - Read each process's:
- Memory pages
- Open files / socket states / pipes
- Namespaces (PID, mnt, net, uts, ipc, user)
- Threads, futexes, signals
- Serialize all of the above to disk as protobuf-encoded files.
- (Optionally) keep the process running by re-attaching, or kill it.
- On the target host, recreate processes with the same PIDs (Linux allows requesting specific PIDs in a new namespace), restore memory pages, reattach FDs.
- Resume execution.
The crux is that the kernel state must match: file paths, mounted volumes, network state, /etc/hosts, devices. CRIU detects mismatches and refuses to restore in many cases.
Why this is rare in practice
- Kernel-version sensitivity. A process snapshotted on kernel 5.10 may fail to restore on 5.15 because internal kernel structures changed.
- Hardware sensitivity. Different CPU vendor or microarchitecture can break (AVX availability, TSC behavior, randomness sources).
- Network and filesystem state. Open TCP connections, NFS handles, special files — easy to break.
- Most workloads do not need it. A stateless web server can restart in seconds; you do not need to migrate it live, you just deploy a new one.
- Modern alternatives. Kubernetes pod eviction + rescheduling, blue-green deploys, rolling update — all simpler than CRIU.
When it actually helps
- Scientific computing: a 12-hour simulation that has run for 8 hours; the host needs maintenance. Snapshot, migrate, resume. Beats restarting.
- Long-running ML training jobs: similar.
- Hot patching of stateful in-memory services: rare, advanced.
- Container live-migration in research/POC: CRIU is the building block for projects like P.Haul (process haul) or container live-migration prototypes.
Examples
Setup
Kernel must be built with CRIU options. Most modern distros are. Check:
zgrep CONFIG_CHECKPOINT /proc/config.gz
# CONFIG_CHECKPOINT_RESTORE=yInstall CRIU:
sudo apt install -y criu # Debian/Ubuntu
sudo dnf install -y criu # Fedora/RHEL
criu check
# Looks OK.Enable experimental in /etc/docker/daemon.json:
{
"experimental": true
}sudo systemctl restart docker
docker version | grep ExperimentalSame-host snapshot/restore (the simplest case)
Start a long-running container:
docker run -d --name counter --rm \
busybox sh -c 'i=0; while true; do echo $i; i=$((i+1)); sleep 1; done'
# Watch it count
docker logs -f counter
# 0
# 1
# 2
# ...
# 47Checkpoint at 47:
docker checkpoint create counter cp1
# cp1The container is now stopped (default behavior). The checkpoint files are at /var/lib/docker/containers/<id>/checkpoints/cp1/.
Restore:
docker start --checkpoint=cp1 counter
docker logs -f counter
# 48
# 49
# ...The counter resumed from 48, not 0. State preserved.
Snapshot without stopping the container
docker checkpoint create --leave-running counter cp1
# Container keeps running while checkpoint is takenCross-host migration
Step 1 — image must be present on both hosts
# On host A
docker save myorg/app:1.0 | ssh hostb "docker load"
# Or push to a registry and pull on BStep 2 — checkpoint on host A
docker checkpoint create --checkpoint-dir=/var/checkpoints app cp1The --checkpoint-dir overrides the default location so you can grab the files easily.
Step 3 — copy checkpoint and any volume data to host B
rsync -a --delete /var/checkpoints/cp1/ hostb:/var/checkpoints/cp1/
# Plus any bind-mounted directories, named volumes, etc.Step 4 — on host B, recreate container in stopped state
ssh hostb
docker create --name app \
-v /data:/data \
-p 8080:8080 \
myorg/app:1.0
# Note: same volume mounts, same ports, same imageStep 5 — restore from checkpoint
docker start --checkpoint=cp1 --checkpoint-dir=/var/checkpoints app
docker logs app
# Should resume from where host A left offWhat can go wrong
- "Failed to restore: open files mismatch": a file was open on host A that does not exist on host B. Fix the bind mounts.
- "Failed to restore: socket peer not found": an open TCP connection cannot be re-established. CRIU has a
--tcp-establishedmode for short-lived connections, but unreliable for long-lived ones. - "Unable to restore PID X": a PID is already taken on host B's PID namespace. CRIU usually handles this in containers (private PID namespace) but edge cases exist.
- Kernel/glibc/CPU mismatch: one of dozens of subtle errors. Read CRIU docs.
Where it actually shines (real example)
Long-running scientific simulation:
# Day 1: start a 24-hour simulation
docker run -d --name sim --gpus all myorg/simulation:1.0 sim_run config.toml
# 18 hours in, the host's GPU drivers need updating. Cannot afford restart.
docker checkpoint create --leave-running sim cp1 --checkpoint-dir=/scratch/cp1
rsync -a /scratch/cp1/ gpu-host-2:/scratch/cp1/
ssh gpu-host-2
docker create --name sim --gpus all myorg/simulation:1.0 sim_run config.toml
docker start --checkpoint=cp1 --checkpoint-dir=/scratch/cp1 sim
# Simulation resumes on the other host, continues for 6 more hoursReal-world usage
- HPC clusters and research environments (the original use case).
- Some container orchestrators like Singularity (HPC-focused) integrate CRIU as a feature.
- Kubernetes Pod Live Migration is in alpha as of 2024 (KEP-2008), uses CRIU under the hood.
- Docker Swarm and stock K8s production: do not rely on it. Use deploy strategies (rolling, blue-green, canary).
Limitations
- Experimental. Docker has not graduated
checkpointto GA. APIs may change. - No GPU state migration in stock CRIU. Active research, not production.
- No support for some namespaces in older versions.
- Not in Docker Desktop. Docker Desktop's VM does not enable CRIU.
- Performance: writing all of memory to disk takes seconds-to-minutes for large containers.
Alternatives that solve the same problem better
| Need | Better tool |
|---|---|
| Move web server with no downtime | Blue-green deploy with stateless app |
| Move stateful service | Externalize state (DB, S3); redeploy stateless wrapper |
| Maintenance window on a node | Drain, reschedule (K8s, Swarm, Nomad) |
| Long-running computation | Periodic application-level checkpoints (write progress to disk every N minutes; resume from there) |
Application-level checkpoints are usually a better answer than CRIU because:
- They are portable across kernel/CPU/distro changes.
- They are smaller (only your data, not all of memory).
- They survive image upgrades.
- They are testable.
Common mistakes
Treating CRIU as production-grade live migration
It is experimental in Docker. For production, use rolling deploys.
Not matching state across hosts
Volume contents, mounted secrets, host's /etc/resolv.conf — all of it must match. Easier said than done.
Trying to migrate containers with active TCP connections
CRIU's --tcp-established is fragile. Drain connections first, or accept resets.
Skipping kernel match
Migrating from kernel 5.10 to 5.15 may work, may not. Test in staging.
Follow-up questions
Q: Is docker checkpoint enabled by default?
A: No. You must enable "experimental": true in /etc/docker/daemon.json and restart the daemon.
Q: What about Kubernetes live migration?
A: KEP-2008 (Container Live Migration) is in alpha. It uses CRIU underneath. Not recommended for prod yet (as of late 2024).
Q: Can I checkpoint a container with a database inside?
A: Technically yes, but the on-disk DB files must be present on the target host. This is one reason DBs are normally on volumes, and the volume is replicated separately (DB replication).
Q: (Senior) Why is CRIU not the standard answer for moving stateful services?
A: Because the universe of "identical state on both hosts" is fragile. Modern stateful services (databases, message queues) are designed for replication: leader/follower, multi-master, distributed consensus. You move data via the application's replication protocol, not by snapshotting kernel state. CRIU sidesteps the application's data model and assumes it can recreate every byte of process state — far more brittle than mirroring data through Postgres replication or Kafka mirroring.
Q: (Senior) When is the right time to consider CRIU?
A: When (1) the workload is single-instance and cannot trivially restart, (2) progress is in process memory (not in DB or files), (3) you control both hosts including kernel version, (4) the cost of reproducing the state from scratch exceeds the engineering cost of dealing with CRIU's quirks. HPC/research checks all four. Production microservices check none.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.
Comments
No comments yet