Suggest an edit

Improve this article

Refine the answer for “How to live migrate a Docker container between hosts using CRIU?”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**CRIU** (Checkpoint/Restore In Userspace) is a Linux feature that freezes a running process tree to a set of files and restores it later, on the same host or another, ideally with the program none the wiser. Docker integrates CRIU through the experimental `docker checkpoint` subcommand, which is the closest thing Docker has to live container migration. It works for narrow use cases. For most teams, stateless redeploy is simpler and more reliable.

## Theory

### TL;DR

- CRIU dumps a process's memory, file descriptors, sockets, and namespaces to disk; restores them on demand.
- `docker checkpoint create` produces a checkpoint; `docker start --checkpoint` resumes from it.
- Requires `"experimental": true` in `daemon.json` and a kernel with CRIU support.
- Both hosts need same kernel version, same CPU instruction set, same Docker version, identical filesystem state at the time of the snapshot.
- Network state is finicky: the IP/MAC of the container moves with the migration; needs network to allow this.
- **Use case is narrow**: long-running computations where you can't afford to restart from scratch (HPC, ML training, long simulations).
- **Most production patterns** (web apps, microservices) use **stateless redeploy** instead, where state lives outside the container (DB, object storage).

### How CRIU works

1. Stop all processes in the target's PID namespace (using `ptrace`).
2. Read each process's:
   - Memory pages
   - Open files / socket states / pipes
   - Namespaces (PID, mnt, net, uts, ipc, user)
   - Threads, futexes, signals
3. Serialize all of the above to disk as protobuf-encoded files.
4. (Optionally) keep the process running by re-attaching, or kill it.
5. On the target host, recreate processes with the same PIDs (Linux allows requesting specific PIDs in a new namespace), restore memory pages, reattach FDs.
6. Resume execution.

The crux is that the **kernel state** must match: file paths, mounted volumes, network state, `/etc/hosts`, devices. CRIU detects mismatches and refuses to restore in many cases.

### Why this is rare in practice

- **Kernel-version sensitivity.** A process snapshotted on kernel 5.10 may fail to restore on 5.15 because internal kernel structures changed.
- **Hardware sensitivity.** Different CPU vendor or microarchitecture can break (AVX availability, TSC behavior, randomness sources).
- **Network and filesystem state.** Open TCP connections, NFS handles, special files — easy to break.
- **Most workloads do not need it.** A stateless web server can restart in seconds; you do not need to migrate it live, you just deploy a new one.
- **Modern alternatives.** Kubernetes pod eviction + rescheduling, blue-green deploys, rolling update — all simpler than CRIU.

### When it actually helps

- **Scientific computing**: a 12-hour simulation that has run for 8 hours; the host needs maintenance. Snapshot, migrate, resume. Beats restarting.
- **Long-running ML training jobs**: similar.
- **Hot patching of stateful in-memory services**: rare, advanced.
- **Container live-migration in research/POC**: CRIU is the building block for projects like P.Haul (process haul) or container live-migration prototypes.

## Examples

### Setup

Kernel must be built with CRIU options. Most modern distros are. Check:

```bash
zgrep CONFIG_CHECKPOINT /proc/config.gz
# CONFIG_CHECKPOINT_RESTORE=y
```

Install CRIU:

```bash
sudo apt install -y criu        # Debian/Ubuntu
sudo dnf install -y criu        # Fedora/RHEL
criu check
# Looks OK.
```

Enable experimental in `/etc/docker/daemon.json`:

```json
{
  "experimental": true
}
```

```bash
sudo systemctl restart docker
docker version | grep Experimental
```

### Same-host snapshot/restore (the simplest case)

Start a long-running container:

```bash
docker run -d --name counter --rm \
    busybox sh -c 'i=0; while true; do echo $i; i=$((i+1)); sleep 1; done'

# Watch it count
docker logs -f counter
# 0
# 1
# 2
# ...
# 47
```

Checkpoint at 47:

```bash
docker checkpoint create counter cp1
# cp1
```

The container is now stopped (default behavior). The checkpoint files are at `/var/lib/docker/containers/<id>/checkpoints/cp1/`.

Restore:

```bash
docker start --checkpoint=cp1 counter
docker logs -f counter
# 48
# 49
# ...
```

The counter resumed from 48, not 0. State preserved.

### Snapshot without stopping the container

```bash
docker checkpoint create --leave-running counter cp1
# Container keeps running while checkpoint is taken
```

### Cross-host migration

#### Step 1 — image must be present on both hosts

```bash
# On host A
docker save myorg/app:1.0 | ssh hostb "docker load"
# Or push to a registry and pull on B
```

#### Step 2 — checkpoint on host A

```bash
docker checkpoint create --checkpoint-dir=/var/checkpoints app cp1
```

The `--checkpoint-dir` overrides the default location so you can grab the files easily.

#### Step 3 — copy checkpoint and any volume data to host B

```bash
rsync -a --delete /var/checkpoints/cp1/ hostb:/var/checkpoints/cp1/
# Plus any bind-mounted directories, named volumes, etc.
```

#### Step 4 — on host B, recreate container in stopped state

```bash
ssh hostb
docker create --name app \
    -v /data:/data \
    -p 8080:8080 \
    myorg/app:1.0
# Note: same volume mounts, same ports, same image
```

#### Step 5 — restore from checkpoint

```bash
docker start --checkpoint=cp1 --checkpoint-dir=/var/checkpoints app
docker logs app
# Should resume from where host A left off
```

### What can go wrong

- **"Failed to restore: open files mismatch"**: a file was open on host A that does not exist on host B. Fix the bind mounts.
- **"Failed to restore: socket peer not found"**: an open TCP connection cannot be re-established. CRIU has a `--tcp-established` mode for short-lived connections, but unreliable for long-lived ones.
- **"Unable to restore PID X"**: a PID is already taken on host B's PID namespace. CRIU usually handles this in containers (private PID namespace) but edge cases exist.
- **Kernel/glibc/CPU mismatch**: one of dozens of subtle errors. Read CRIU docs.

### Where it actually shines (real example)

Long-running scientific simulation:

```bash
# Day 1: start a 24-hour simulation
docker run -d --name sim --gpus all myorg/simulation:1.0 sim_run config.toml

# 18 hours in, the host's GPU drivers need updating. Cannot afford restart.
docker checkpoint create --leave-running sim cp1 --checkpoint-dir=/scratch/cp1
rsync -a /scratch/cp1/ gpu-host-2:/scratch/cp1/
ssh gpu-host-2
docker create --name sim --gpus all myorg/simulation:1.0 sim_run config.toml
docker start --checkpoint=cp1 --checkpoint-dir=/scratch/cp1 sim
# Simulation resumes on the other host, continues for 6 more hours
```

## Real-world usage

- **HPC clusters and research environments** (the original use case).
- **Some container orchestrators** like Singularity (HPC-focused) integrate CRIU as a feature.
- **Kubernetes Pod Live Migration** is in alpha as of 2024 (KEP-2008), uses CRIU under the hood.
- **Docker Swarm and stock K8s production**: do not rely on it. Use deploy strategies (rolling, blue-green, canary).

### Limitations

1. **Experimental.** Docker has not graduated `checkpoint` to GA. APIs may change.
2. **No GPU state migration** in stock CRIU. Active research, not production.
3. **No support for some namespaces** in older versions.
4. **Not in Docker Desktop.** Docker Desktop's VM does not enable CRIU.
5. **Performance**: writing all of memory to disk takes seconds-to-minutes for large containers.

### Alternatives that solve the same problem better

| Need | Better tool |
|---|---|
| Move web server with no downtime | Blue-green deploy with stateless app |
| Move stateful service | Externalize state (DB, S3); redeploy stateless wrapper |
| Maintenance window on a node | Drain, reschedule (K8s, Swarm, Nomad) |
| Long-running computation | Periodic application-level checkpoints (write progress to disk every N minutes; resume from there) |

Application-level checkpoints are usually a better answer than CRIU because:
- They are portable across kernel/CPU/distro changes.
- They are smaller (only your data, not all of memory).
- They survive image upgrades.
- They are testable.

### Common mistakes

**Treating CRIU as production-grade live migration**

It is experimental in Docker. For production, use rolling deploys.

**Not matching state across hosts**

Volume contents, mounted secrets, host's `/etc/resolv.conf` — all of it must match. Easier said than done.

**Trying to migrate containers with active TCP connections**

CRIU's `--tcp-established` is fragile. Drain connections first, or accept resets.

**Skipping kernel match**

Migrating from kernel 5.10 to 5.15 may work, may not. Test in staging.

### Follow-up questions

**Q:** Is `docker checkpoint` enabled by default?

**A:** No. You must enable `"experimental": true` in `/etc/docker/daemon.json` and restart the daemon.

**Q:** What about Kubernetes live migration?

**A:** KEP-2008 (Container Live Migration) is in alpha. It uses CRIU underneath. Not recommended for prod yet (as of late 2024).

**Q:** Can I checkpoint a container with a database inside?

**A:** Technically yes, but the on-disk DB files must be present on the target host. This is one reason DBs are normally on volumes, and the volume is replicated separately (DB replication).

**Q:** (Senior) Why is CRIU not the standard answer for moving stateful services?

**A:** Because the universe of "identical state on both hosts" is fragile. Modern stateful services (databases, message queues) are designed for replication: leader/follower, multi-master, distributed consensus. You move data via the application's replication protocol, not by snapshotting kernel state. CRIU sidesteps the application's data model and assumes it can recreate every byte of process state — far more brittle than mirroring data through Postgres replication or Kafka mirroring.

**Q:** (Senior) When is the right time to consider CRIU?

**A:** When (1) the workload is single-instance and cannot trivially restart, (2) progress is in process memory (not in DB or files), (3) you control both hosts including kernel version, (4) the cost of reproducing the state from scratch exceeds the engineering cost of dealing with CRIU's quirks. HPC/research checks all four. Production microservices check none.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1556 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.