How to live migrate a Docker container between hosts using CRIU?

docs.questions.sections.docker~6 min read

CRIU (Checkpoint/Restore In Userspace) is a Linux feature that freezes a running process tree to a set of files and restores it later, on the same host or another, ideally with the program none the wiser. Docker integrates CRIU through the experimental docker checkpoint subcommand, which is the closest thing Docker has to live container migration. It works for narrow use cases. For most teams, stateless redeploy is simpler and more reliable.

Theory

TL;DR

CRIU dumps a process's memory, file descriptors, sockets, and namespaces to disk; restores them on demand.
docker checkpoint create produces a checkpoint; docker start --checkpoint resumes from it.
Requires "experimental": true in daemon.json and a kernel with CRIU support.
Both hosts need same kernel version, same CPU instruction set, same Docker version, identical filesystem state at the time of the snapshot.
Network state is finicky: the IP/MAC of the container moves with the migration; needs network to allow this.
Use case is narrow: long-running computations where you can't afford to restart from scratch (HPC, ML training, long simulations).
Most production patterns (web apps, microservices) use stateless redeploy instead, where state lives outside the container (DB, object storage).

How CRIU works

Stop all processes in the target's PID namespace (using ptrace).
Read each process's:
- Memory pages
- Open files / socket states / pipes
- Namespaces (PID, mnt, net, uts, ipc, user)
- Threads, futexes, signals
Serialize all of the above to disk as protobuf-encoded files.
(Optionally) keep the process running by re-attaching, or kill it.
On the target host, recreate processes with the same PIDs (Linux allows requesting specific PIDs in a new namespace), restore memory pages, reattach FDs.
Resume execution.

The crux is that the kernel state must match: file paths, mounted volumes, network state, /etc/hosts, devices. CRIU detects mismatches and refuses to restore in many cases.

Why this is rare in practice

Kernel-version sensitivity. A process snapshotted on kernel 5.10 may fail to restore on 5.15 because internal kernel structures changed.
Hardware sensitivity. Different CPU vendor or microarchitecture can break (AVX availability, TSC behavior, randomness sources).
Network and filesystem state. Open TCP connections, NFS handles, special files — easy to break.
Most workloads do not need it. A stateless web server can restart in seconds; you do not need to migrate it live, you just deploy a new one.
Modern alternatives. Kubernetes pod eviction + rescheduling, blue-green deploys, rolling update — all simpler than CRIU.

When it actually helps

Scientific computing: a 12-hour simulation that has run for 8 hours; the host needs maintenance. Snapshot, migrate, resume. Beats restarting.
Long-running ML training jobs: similar.
Hot patching of stateful in-memory services: rare, advanced.
Container live-migration in research/POC: CRIU is the building block for projects like P.Haul (process haul) or container live-migration prototypes.

Examples

Setup

Kernel must be built with CRIU options. Most modern distros are. Check:

bash

zgrep CONFIG_CHECKPOINT /proc/config.gz
# CONFIG_CHECKPOINT_RESTORE=y

Install CRIU:

bash

sudo apt install -y criu        # Debian/Ubuntu
sudo dnf install -y criu        # Fedora/RHEL
criu check
# Looks OK.

Enable experimental in /etc/docker/daemon.json:

json

{
  "experimental": true
}

bash

sudo systemctl restart docker
docker version | grep Experimental

Same-host snapshot/restore (the simplest case)

Start a long-running container:

bash

docker run -d --name counter --rm \
    busybox sh -c 'i=0; while true; do echo $i; i=$((i+1)); sleep 1; done'

# Watch it count
docker logs -f counter
# 0
# 1
# 2
# ...
# 47

Checkpoint at 47:

bash

docker checkpoint create counter cp1
# cp1

The container is now stopped (default behavior). The checkpoint files are at /var/lib/docker/containers/<id>/checkpoints/cp1/.

Restore:

bash

docker start --checkpoint=cp1 counter
docker logs -f counter
# 48
# 49
# ...

The counter resumed from 48, not 0. State preserved.

Snapshot without stopping the container

bash

docker checkpoint create --leave-running counter cp1
# Container keeps running while checkpoint is taken

Cross-host migration

Step 1 — image must be present on both hosts

bash

# On host A
docker save myorg/app:1.0 | ssh hostb "docker load"
# Or push to a registry and pull on B

Step 2 — checkpoint on host A

bash

docker checkpoint create --checkpoint-dir=/var/checkpoints app cp1

The --checkpoint-dir overrides the default location so you can grab the files easily.

Step 3 — copy checkpoint and any volume data to host B

bash

rsync -a --delete /var/checkpoints/cp1/ hostb:/var/checkpoints/cp1/
# Plus any bind-mounted directories, named volumes, etc.

Step 4 — on host B, recreate container in stopped state

bash

ssh hostb
docker create --name app \
    -v /data:/data \
    -p 8080:8080 \
    myorg/app:1.0
# Note: same volume mounts, same ports, same image

Step 5 — restore from checkpoint

bash

docker start --checkpoint=cp1 --checkpoint-dir=/var/checkpoints app
docker logs app
# Should resume from where host A left off

What can go wrong

"Failed to restore: open files mismatch": a file was open on host A that does not exist on host B. Fix the bind mounts.
"Failed to restore: socket peer not found": an open TCP connection cannot be re-established. CRIU has a --tcp-established mode for short-lived connections, but unreliable for long-lived ones.
"Unable to restore PID X": a PID is already taken on host B's PID namespace. CRIU usually handles this in containers (private PID namespace) but edge cases exist.
Kernel/glibc/CPU mismatch: one of dozens of subtle errors. Read CRIU docs.

Where it actually shines (real example)

Long-running scientific simulation:

bash

# Day 1: start a 24-hour simulation
docker run -d --name sim --gpus all myorg/simulation:1.0 sim_run config.toml

# 18 hours in, the host's GPU drivers need updating. Cannot afford restart.
docker checkpoint create --leave-running sim cp1 --checkpoint-dir=/scratch/cp1
rsync -a /scratch/cp1/ gpu-host-2:/scratch/cp1/
ssh gpu-host-2
docker create --name sim --gpus all myorg/simulation:1.0 sim_run config.toml
docker start --checkpoint=cp1 --checkpoint-dir=/scratch/cp1 sim
# Simulation resumes on the other host, continues for 6 more hours

Real-world usage

HPC clusters and research environments (the original use case).
Some container orchestrators like Singularity (HPC-focused) integrate CRIU as a feature.
Kubernetes Pod Live Migration is in alpha as of 2024 (KEP-2008), uses CRIU under the hood.
Docker Swarm and stock K8s production: do not rely on it. Use deploy strategies (rolling, blue-green, canary).

Limitations

Experimental. Docker has not graduated checkpoint to GA. APIs may change.
No GPU state migration in stock CRIU. Active research, not production.
No support for some namespaces in older versions.
Not in Docker Desktop. Docker Desktop's VM does not enable CRIU.
Performance: writing all of memory to disk takes seconds-to-minutes for large containers.

Alternatives that solve the same problem better

Need	Better tool
Move web server with no downtime	Blue-green deploy with stateless app
Move stateful service	Externalize state (DB, S3); redeploy stateless wrapper
Maintenance window on a node	Drain, reschedule (K8s, Swarm, Nomad)
Long-running computation	Periodic application-level checkpoints (write progress to disk every N minutes; resume from there)

Application-level checkpoints are usually a better answer than CRIU because:

They are portable across kernel/CPU/distro changes.
They are smaller (only your data, not all of memory).
They survive image upgrades.
They are testable.

Common mistakes

Treating CRIU as production-grade live migration

It is experimental in Docker. For production, use rolling deploys.

Not matching state across hosts

Volume contents, mounted secrets, host's /etc/resolv.conf — all of it must match. Easier said than done.

Trying to migrate containers with active TCP connections

CRIU's --tcp-established is fragile. Drain connections first, or accept resets.

Skipping kernel match

Migrating from kernel 5.10 to 5.15 may work, may not. Test in staging.

Follow-up questions

Q: Is docker checkpoint enabled by default?

A: No. You must enable "experimental": true in /etc/docker/daemon.json and restart the daemon.

Q: What about Kubernetes live migration?

A: KEP-2008 (Container Live Migration) is in alpha. It uses CRIU underneath. Not recommended for prod yet (as of late 2024).

Q: Can I checkpoint a container with a database inside?

A: Technically yes, but the on-disk DB files must be present on the target host. This is one reason DBs are normally on volumes, and the volume is replicated separately (DB replication).

Q: (Senior) Why is CRIU not the standard answer for moving stateful services?

A: Because the universe of "identical state on both hosts" is fragile. Modern stateful services (databases, message queues) are designed for replication: leader/follower, multi-master, distributed consensus. You move data via the application's replication protocol, not by snapshotting kernel state. CRIU sidesteps the application's data model and assumes it can recreate every byte of process state — far more brittle than mirroring data through Postgres replication or Kafka mirroring.

Q: (Senior) When is the right time to consider CRIU?

A: When (1) the workload is single-instance and cannot trivially restart, (2) progress is in process memory (not in DB or files), (3) you control both hosts including kernel version, (4) the cost of reproducing the state from scratch exceeds the engineering cost of dealing with CRIU's quirks. HPC/research checks all four. Production microservices check none.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?