Suggest an editImprove this articleRefine the answer for “How to live migrate a Docker container between hosts using CRIU?”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)Docker exposes **`docker checkpoint`** (powered by CRIU) to freeze a running container's state to disk and restore it elsewhere. ```bash # Enable experimental on the daemon: "experimental": true in /etc/docker/daemon.json # On host A: snapshot a running container docker checkpoint create app cp1 --checkpoint-dir=/var/checkpoints # Copy /var/checkpoints/cp1 to host B (rsync, scp) rsync -a /var/checkpoints/cp1/ hostb:/var/checkpoints/cp1/ # On host B: create the container in stopped state and start from checkpoint docker create --name app myorg/app:1.0 docker start --checkpoint=cp1 --checkpoint-dir=/var/checkpoints app ``` **Reality check:** this is **experimental and rarely production-grade.** Most teams use **stateless redeploy** (kill on A, start on B with shared state) instead of CRIU. CRIU shines for HPC/scientific workloads with long-running computations.Shown above the full answer for quick recall.Answer (EN)Image**CRIU** (Checkpoint/Restore In Userspace) is a Linux feature that freezes a running process tree to a set of files and restores it later, on the same host or another, ideally with the program none the wiser. Docker integrates CRIU through the experimental `docker checkpoint` subcommand, which is the closest thing Docker has to live container migration. It works for narrow use cases. For most teams, stateless redeploy is simpler and more reliable. ## Theory ### TL;DR - CRIU dumps a process's memory, file descriptors, sockets, and namespaces to disk; restores them on demand. - `docker checkpoint create` produces a checkpoint; `docker start --checkpoint` resumes from it. - Requires `"experimental": true` in `daemon.json` and a kernel with CRIU support. - Both hosts need same kernel version, same CPU instruction set, same Docker version, identical filesystem state at the time of the snapshot. - Network state is finicky: the IP/MAC of the container moves with the migration; needs network to allow this. - **Use case is narrow**: long-running computations where you can't afford to restart from scratch (HPC, ML training, long simulations). - **Most production patterns** (web apps, microservices) use **stateless redeploy** instead, where state lives outside the container (DB, object storage). ### How CRIU works 1. Stop all processes in the target's PID namespace (using `ptrace`). 2. Read each process's: - Memory pages - Open files / socket states / pipes - Namespaces (PID, mnt, net, uts, ipc, user) - Threads, futexes, signals 3. Serialize all of the above to disk as protobuf-encoded files. 4. (Optionally) keep the process running by re-attaching, or kill it. 5. On the target host, recreate processes with the same PIDs (Linux allows requesting specific PIDs in a new namespace), restore memory pages, reattach FDs. 6. Resume execution. The crux is that the **kernel state** must match: file paths, mounted volumes, network state, `/etc/hosts`, devices. CRIU detects mismatches and refuses to restore in many cases. ### Why this is rare in practice - **Kernel-version sensitivity.** A process snapshotted on kernel 5.10 may fail to restore on 5.15 because internal kernel structures changed. - **Hardware sensitivity.** Different CPU vendor or microarchitecture can break (AVX availability, TSC behavior, randomness sources). - **Network and filesystem state.** Open TCP connections, NFS handles, special files — easy to break. - **Most workloads do not need it.** A stateless web server can restart in seconds; you do not need to migrate it live, you just deploy a new one. - **Modern alternatives.** Kubernetes pod eviction + rescheduling, blue-green deploys, rolling update — all simpler than CRIU. ### When it actually helps - **Scientific computing**: a 12-hour simulation that has run for 8 hours; the host needs maintenance. Snapshot, migrate, resume. Beats restarting. - **Long-running ML training jobs**: similar. - **Hot patching of stateful in-memory services**: rare, advanced. - **Container live-migration in research/POC**: CRIU is the building block for projects like P.Haul (process haul) or container live-migration prototypes. ## Examples ### Setup Kernel must be built with CRIU options. Most modern distros are. Check: ```bash zgrep CONFIG_CHECKPOINT /proc/config.gz # CONFIG_CHECKPOINT_RESTORE=y ``` Install CRIU: ```bash sudo apt install -y criu # Debian/Ubuntu sudo dnf install -y criu # Fedora/RHEL criu check # Looks OK. ``` Enable experimental in `/etc/docker/daemon.json`: ```json { "experimental": true } ``` ```bash sudo systemctl restart docker docker version | grep Experimental ``` ### Same-host snapshot/restore (the simplest case) Start a long-running container: ```bash docker run -d --name counter --rm \ busybox sh -c 'i=0; while true; do echo $i; i=$((i+1)); sleep 1; done' # Watch it count docker logs -f counter # 0 # 1 # 2 # ... # 47 ``` Checkpoint at 47: ```bash docker checkpoint create counter cp1 # cp1 ``` The container is now stopped (default behavior). The checkpoint files are at `/var/lib/docker/containers/<id>/checkpoints/cp1/`. Restore: ```bash docker start --checkpoint=cp1 counter docker logs -f counter # 48 # 49 # ... ``` The counter resumed from 48, not 0. State preserved. ### Snapshot without stopping the container ```bash docker checkpoint create --leave-running counter cp1 # Container keeps running while checkpoint is taken ``` ### Cross-host migration #### Step 1 — image must be present on both hosts ```bash # On host A docker save myorg/app:1.0 | ssh hostb "docker load" # Or push to a registry and pull on B ``` #### Step 2 — checkpoint on host A ```bash docker checkpoint create --checkpoint-dir=/var/checkpoints app cp1 ``` The `--checkpoint-dir` overrides the default location so you can grab the files easily. #### Step 3 — copy checkpoint and any volume data to host B ```bash rsync -a --delete /var/checkpoints/cp1/ hostb:/var/checkpoints/cp1/ # Plus any bind-mounted directories, named volumes, etc. ``` #### Step 4 — on host B, recreate container in stopped state ```bash ssh hostb docker create --name app \ -v /data:/data \ -p 8080:8080 \ myorg/app:1.0 # Note: same volume mounts, same ports, same image ``` #### Step 5 — restore from checkpoint ```bash docker start --checkpoint=cp1 --checkpoint-dir=/var/checkpoints app docker logs app # Should resume from where host A left off ``` ### What can go wrong - **"Failed to restore: open files mismatch"**: a file was open on host A that does not exist on host B. Fix the bind mounts. - **"Failed to restore: socket peer not found"**: an open TCP connection cannot be re-established. CRIU has a `--tcp-established` mode for short-lived connections, but unreliable for long-lived ones. - **"Unable to restore PID X"**: a PID is already taken on host B's PID namespace. CRIU usually handles this in containers (private PID namespace) but edge cases exist. - **Kernel/glibc/CPU mismatch**: one of dozens of subtle errors. Read CRIU docs. ### Where it actually shines (real example) Long-running scientific simulation: ```bash # Day 1: start a 24-hour simulation docker run -d --name sim --gpus all myorg/simulation:1.0 sim_run config.toml # 18 hours in, the host's GPU drivers need updating. Cannot afford restart. docker checkpoint create --leave-running sim cp1 --checkpoint-dir=/scratch/cp1 rsync -a /scratch/cp1/ gpu-host-2:/scratch/cp1/ ssh gpu-host-2 docker create --name sim --gpus all myorg/simulation:1.0 sim_run config.toml docker start --checkpoint=cp1 --checkpoint-dir=/scratch/cp1 sim # Simulation resumes on the other host, continues for 6 more hours ``` ## Real-world usage - **HPC clusters and research environments** (the original use case). - **Some container orchestrators** like Singularity (HPC-focused) integrate CRIU as a feature. - **Kubernetes Pod Live Migration** is in alpha as of 2024 (KEP-2008), uses CRIU under the hood. - **Docker Swarm and stock K8s production**: do not rely on it. Use deploy strategies (rolling, blue-green, canary). ### Limitations 1. **Experimental.** Docker has not graduated `checkpoint` to GA. APIs may change. 2. **No GPU state migration** in stock CRIU. Active research, not production. 3. **No support for some namespaces** in older versions. 4. **Not in Docker Desktop.** Docker Desktop's VM does not enable CRIU. 5. **Performance**: writing all of memory to disk takes seconds-to-minutes for large containers. ### Alternatives that solve the same problem better | Need | Better tool | |---|---| | Move web server with no downtime | Blue-green deploy with stateless app | | Move stateful service | Externalize state (DB, S3); redeploy stateless wrapper | | Maintenance window on a node | Drain, reschedule (K8s, Swarm, Nomad) | | Long-running computation | Periodic application-level checkpoints (write progress to disk every N minutes; resume from there) | Application-level checkpoints are usually a better answer than CRIU because: - They are portable across kernel/CPU/distro changes. - They are smaller (only your data, not all of memory). - They survive image upgrades. - They are testable. ### Common mistakes **Treating CRIU as production-grade live migration** It is experimental in Docker. For production, use rolling deploys. **Not matching state across hosts** Volume contents, mounted secrets, host's `/etc/resolv.conf` — all of it must match. Easier said than done. **Trying to migrate containers with active TCP connections** CRIU's `--tcp-established` is fragile. Drain connections first, or accept resets. **Skipping kernel match** Migrating from kernel 5.10 to 5.15 may work, may not. Test in staging. ### Follow-up questions **Q:** Is `docker checkpoint` enabled by default? **A:** No. You must enable `"experimental": true` in `/etc/docker/daemon.json` and restart the daemon. **Q:** What about Kubernetes live migration? **A:** KEP-2008 (Container Live Migration) is in alpha. It uses CRIU underneath. Not recommended for prod yet (as of late 2024). **Q:** Can I checkpoint a container with a database inside? **A:** Technically yes, but the on-disk DB files must be present on the target host. This is one reason DBs are normally on volumes, and the volume is replicated separately (DB replication). **Q:** (Senior) Why is CRIU not the standard answer for moving stateful services? **A:** Because the universe of "identical state on both hosts" is fragile. Modern stateful services (databases, message queues) are designed for replication: leader/follower, multi-master, distributed consensus. You move data via the application's replication protocol, not by snapshotting kernel state. CRIU sidesteps the application's data model and assumes it can recreate every byte of process state — far more brittle than mirroring data through Postgres replication or Kafka mirroring. **Q:** (Senior) When is the right time to consider CRIU? **A:** When (1) the workload is single-instance and cannot trivially restart, (2) progress is in process memory (not in DB or files), (3) you control both hosts including kernel version, (4) the cost of reproducing the state from scratch exceeds the engineering cost of dealing with CRIU's quirks. HPC/research checks all four. Production microservices check none.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.