Suggest an edit

Improve this article

Refine the answer for “What are Linux namespaces and cgroups in Docker?”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Linux namespaces and cgroups** are the two kernel mechanisms that make containers possible. Without them, you have processes; with them, you have containers. Docker is mostly a tool for configuring these features at scale.

## Theory

### TL;DR

- **Namespaces** answer "what can this process see?". Seven types: PID, mount, network, IPC, UTS, user, cgroup.
- **cgroups** answer "what can this process use?". Limits CPU, memory, I/O, PIDs.
- Both are kernel features that pre-date Docker by years (LXC popularized them; Docker made them mainstream).
- A container is conceptually: `unshare()` to create namespaces + `cgroups` config + `chroot` (or pivot_root) into a rootfs + `exec()` your binary.
- Modern Docker on Linux uses **cgroups v2** (unified hierarchy). Legacy v1 had separate hierarchies per controller.

### Namespaces — the seven types

| Namespace | What it isolates |
|---|---|
| **PID** | process IDs — container has its own PID 1 |
| **mount (mnt)** | mount points — container has its own filesystem view |
| **network (net)** | network interfaces, routing tables, sockets, ports |
| **IPC** | shared memory, semaphores, message queues |
| **UTS** | hostname and domain name |
| **user** | UIDs/GIDs (with remapping) |
| **cgroup** | cgroup root view (cgroups v2) |

Each namespace is a kernel resource you can attach a process to via `unshare(2)` or `clone(2)`. Docker creates them when starting a container.

### Verifying namespaces in a running container

```bash
# A container's namespaces (each has a unique inode)
$ docker run -it --rm alpine sh
/ # ls -la /proc/self/ns
lrwxrwxrwx ... cgroup -> 'cgroup:[4026532840]'
lrwxrwxrwx ... ipc    -> 'ipc:[4026532838]'
lrwxrwxrwx ... mnt    -> 'mnt:[4026532836]'
lrwxrwxrwx ... net    -> 'net:[4026532842]'
lrwxrwxrwx ... pid    -> 'pid:[4026532839]'
lrwxrwxrwx ... user   -> 'user:[4026531837]'
lrwxrwxrwx ... uts    -> 'uts:[4026532837]'

# Compare with the host (different inodes for everything except possibly user)
$ ls -la /proc/self/ns
```

Different inodes = different namespaces = isolated views.

### cgroups — what they limit

In cgroups v2 (unified hierarchy, the modern norm), the controllers include:

```
cpu          — CPU time (cpus, cpu.weight)
memory       — RAM and swap usage
io           — block I/O bandwidth and IOPS
pids         — number of processes
rdma         — RDMA bandwidth
hugetlb      — huge pages
```

```bash
# Inside a container with --memory=256m
$ docker run --rm --memory=256m alpine cat /sys/fs/cgroup/memory.max
268435456    # 256 * 1024 * 1024 bytes

$ docker run --rm --cpus=0.5 alpine cat /sys/fs/cgroup/cpu.max
50000 100000  # 50ms quota per 100ms period (= 0.5 CPU)
```

Docker translates `--memory`, `--cpus`, etc. into cgroup files in `/sys/fs/cgroup/...`.

### How docker run uses both

```
docker run --memory=256m --cpus=1 myapp
     │
     ├── containerd → runc
     │       │
     │       ├── unshare(CLONE_NEW{PID,NS,NET,IPC,UTS,USER,CGROUP})
     │       │       → 7 fresh namespaces
     │       │
     │       ├── write cgroup files
     │       │       → memory.max=256MB, cpu.max=100000 100000
     │       │
     │       ├── pivot_root into image rootfs
     │       │
     │       └── exec(your-binary)
     │
     └── you see a 'container'
```

Namespaces give it private views; cgroups limit what it can do; pivot_root + image layers give it a custom filesystem; exec runs your binary as PID 1 in this little world.

### User namespace: the security frontier

```bash
# Default (no user namespace remapping): root inside = root on host (unless capability-restricted)
$ docker run --rm alpine id
uid=0(root) gid=0(root)
# Inside the container, you are root; the kernel knows.

# With userns-remap: root inside maps to a non-root UID on host
$ # /etc/docker/daemon.json: { "userns-remap": "default" }
$ docker run --rm alpine id
uid=0(root) gid=0(root)
# Inside, still root. But on the host, processes are some-non-root-UID.
```

User namespace remapping is one of the strongest container hardenings. A container escape no longer means root on the host.

### cgroups v1 vs v2

- **v1 (legacy):** separate hierarchies per controller (`/sys/fs/cgroup/memory`, `/sys/fs/cgroup/cpu`, etc.). Each process belongs to one cgroup per controller.
- **v2 (modern):** single unified hierarchy. Each cgroup controls all enabled resources. Simpler, more consistent.
- Most modern Linux distros (Fedora, Debian 11+, Ubuntu 22+) default to v2. Docker handles both.

Check which your host uses:

```bash
stat -fc %T /sys/fs/cgroup
# 'cgroup2fs' = v2, 'tmpfs' = v1
```

### Common mistakes

**Treating namespaces as security boundaries**

Namespaces hide things from a process; they do not stop a process with the right capabilities from breaking out. Combined with capabilities (default-drop CAP_SYS_ADMIN, etc.) and seccomp (syscall filtering), they form a defense — but namespaces alone are not impenetrable. Real isolation needs the full Docker security stack (or microVMs like Kata/Firecracker).

**Forgetting cgroups limit memory but not OOM behavior**

```bash
docker run --memory=256m greedy-app
# When greedy-app tries to allocate the 257th MB, kernel OOM-kills it.
```

OOM-killed processes inside cgroups exit 137. Your supervisor / restart-policy decides what to do next.

**Confusing user namespace with `--user`**

- `--user 1000:1000` = run the process as a specific UID inside the container's user namespace (still root inside if not remapped).
- `userns-remap` = remap UIDs from container to host. Unrelated knob.

**Assuming PID namespaces work like containers everywhere**

Kubernetes pods can share PID namespaces between containers (the `shareProcessNamespace: true` flag). In plain Docker, two `docker run` containers always have separate PID namespaces. Different mental model.

### Real-world usage

- **Every container, everywhere.** Namespaces and cgroups underlie every Docker, Podman, K8s pod, Lambda function (via Firecracker microVM), Cloud Run task.
- **Hardened multi-tenant:** user namespace remapping + dropped capabilities + seccomp profile + read-only filesystem → strong-ish isolation.
- **Resource governance:** cgroup limits prevent one tenant from starving others on shared infrastructure.
- **Debugging container internals:** `lsns`, `nsenter`, `/proc/<pid>/ns/*` to inspect or join namespaces from outside.

### Follow-up questions

**Q:** Are namespaces or cgroups Linux-specific?

**A:** Yes. Both are Linux kernel features. Docker on Mac/Windows runs a Linux VM under the hood for this reason.

**Q:** What is `unshare`?

**A:** A syscall (and CLI tool) that creates a new namespace and attaches the calling process. `unshare --pid --fork --mount-proc /bin/bash` gives you a shell in fresh PID + mount namespaces — the simplest "build a container by hand" demo.

**Q:** What is the difference between cgroups and ulimits?

**A:** ulimits (resource limits via PAM, `setrlimit`) are per-process. cgroups apply to a tree of processes and survive across exec. Both can limit resources; cgroups are the kernel's modern, hierarchical answer.

**Q:** Can I see what cgroup a host process is in?

**A:** Yes: `cat /proc/<pid>/cgroup`. For a Docker process, this points into `/sys/fs/cgroup/<docker-cgroup-path>`.

**Q:** (Senior) How would you build a minimal container by hand using just namespaces + cgroups?

**A:** `unshare --pid --net --mount --uts --ipc --user --fork chroot /path/to/rootfs /bin/sh` gives you a process with isolated namespaces and a custom rootfs — the bare bones of a container. Add cgroups manually by writing to `/sys/fs/cgroup/<your-cgroup>/...`. This is roughly what `runc` does for you, automated and OCI-spec-compliant. Useful exercise to demystify what containers are.

## Examples

### Demonstrate namespace isolation

```bash
# Host
$ ps -ef | wc -l
312

# Inside container
$ docker run --rm alpine ps -ef | wc -l
2
# Container only sees its own processes (PID 1 + ps itself)
```

Same kernel, two views — that is the PID namespace in action.

### Demonstrate cgroups

```bash
# A memory-greedy program
$ docker run --rm --memory=64m alpine sh -c 'dd if=/dev/zero of=/dev/null bs=1G count=1'
Killed
# Container OOM-killed at the 64MB limit.

$ docker run --rm --cpus=0.1 alpine sh -c 'time -- timeout 5 sh -c "yes > /dev/null"'
real    0m 5.00s
user    0m 0.50s
sys     0m 0.00s
# user time = 0.5s in 5s wall = 10% CPU = the cap
```

The kernel enforces, Docker just sets the parameters.

### Mapping --user vs userns-remap

```bash
# --user: changes UID inside (no remap)
$ docker run --rm --user 1000:1000 alpine id
uid=1000 gid=1000
# Outside, still UID 1000 (no isolation between in/out namespaces).

# With userns-remap (daemon-level config):
# /etc/docker/daemon.json: { "userns-remap": "default" }
$ docker run --rm alpine id
uid=0(root) gid=0(root)
# Inside is root.
$ ps -ef | grep <container-pid>
UID 165536 ...
# But on the host, the same process is UID 165536 (offset).
```

The second is much stronger isolation. Highly recommended for multi-tenant clusters.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1360 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.