Suggest an editImprove this articleRefine the answer for “What are Linux namespaces and cgroups in Docker?”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)**Linux namespaces** isolate what a process can see (PIDs, mount points, network, etc.). **cgroups** limit what a process can use (CPU, memory, I/O). Together they make containers possible — a container is a process with namespaces + cgroups applied. ```bash # Inside a container, lsns shows namespaces; cgroup limits show via /sys/fs/cgroup docker run --rm alpine lsns docker run --rm --memory=256m alpine cat /sys/fs/cgroup/memory.max ``` **Key:** namespaces = isolation (visibility); cgroups = resource control (consumption). Docker is largely a wrapper around configuring these kernel features for processes.Shown above the full answer for quick recall.Answer (EN)Image**Linux namespaces and cgroups** are the two kernel mechanisms that make containers possible. Without them, you have processes; with them, you have containers. Docker is mostly a tool for configuring these features at scale. ## Theory ### TL;DR - **Namespaces** answer "what can this process see?". Seven types: PID, mount, network, IPC, UTS, user, cgroup. - **cgroups** answer "what can this process use?". Limits CPU, memory, I/O, PIDs. - Both are kernel features that pre-date Docker by years (LXC popularized them; Docker made them mainstream). - A container is conceptually: `unshare()` to create namespaces + `cgroups` config + `chroot` (or pivot_root) into a rootfs + `exec()` your binary. - Modern Docker on Linux uses **cgroups v2** (unified hierarchy). Legacy v1 had separate hierarchies per controller. ### Namespaces — the seven types | Namespace | What it isolates | |---|---| | **PID** | process IDs — container has its own PID 1 | | **mount (mnt)** | mount points — container has its own filesystem view | | **network (net)** | network interfaces, routing tables, sockets, ports | | **IPC** | shared memory, semaphores, message queues | | **UTS** | hostname and domain name | | **user** | UIDs/GIDs (with remapping) | | **cgroup** | cgroup root view (cgroups v2) | Each namespace is a kernel resource you can attach a process to via `unshare(2)` or `clone(2)`. Docker creates them when starting a container. ### Verifying namespaces in a running container ```bash # A container's namespaces (each has a unique inode) $ docker run -it --rm alpine sh / # ls -la /proc/self/ns lrwxrwxrwx ... cgroup -> 'cgroup:[4026532840]' lrwxrwxrwx ... ipc -> 'ipc:[4026532838]' lrwxrwxrwx ... mnt -> 'mnt:[4026532836]' lrwxrwxrwx ... net -> 'net:[4026532842]' lrwxrwxrwx ... pid -> 'pid:[4026532839]' lrwxrwxrwx ... user -> 'user:[4026531837]' lrwxrwxrwx ... uts -> 'uts:[4026532837]' # Compare with the host (different inodes for everything except possibly user) $ ls -la /proc/self/ns ``` Different inodes = different namespaces = isolated views. ### cgroups — what they limit In cgroups v2 (unified hierarchy, the modern norm), the controllers include: ``` cpu — CPU time (cpus, cpu.weight) memory — RAM and swap usage io — block I/O bandwidth and IOPS pids — number of processes rdma — RDMA bandwidth hugetlb — huge pages ``` ```bash # Inside a container with --memory=256m $ docker run --rm --memory=256m alpine cat /sys/fs/cgroup/memory.max 268435456 # 256 * 1024 * 1024 bytes $ docker run --rm --cpus=0.5 alpine cat /sys/fs/cgroup/cpu.max 50000 100000 # 50ms quota per 100ms period (= 0.5 CPU) ``` Docker translates `--memory`, `--cpus`, etc. into cgroup files in `/sys/fs/cgroup/...`. ### How docker run uses both ``` docker run --memory=256m --cpus=1 myapp │ ├── containerd → runc │ │ │ ├── unshare(CLONE_NEW{PID,NS,NET,IPC,UTS,USER,CGROUP}) │ │ → 7 fresh namespaces │ │ │ ├── write cgroup files │ │ → memory.max=256MB, cpu.max=100000 100000 │ │ │ ├── pivot_root into image rootfs │ │ │ └── exec(your-binary) │ └── you see a 'container' ``` Namespaces give it private views; cgroups limit what it can do; pivot_root + image layers give it a custom filesystem; exec runs your binary as PID 1 in this little world. ### User namespace: the security frontier ```bash # Default (no user namespace remapping): root inside = root on host (unless capability-restricted) $ docker run --rm alpine id uid=0(root) gid=0(root) # Inside the container, you are root; the kernel knows. # With userns-remap: root inside maps to a non-root UID on host $ # /etc/docker/daemon.json: { "userns-remap": "default" } $ docker run --rm alpine id uid=0(root) gid=0(root) # Inside, still root. But on the host, processes are some-non-root-UID. ``` User namespace remapping is one of the strongest container hardenings. A container escape no longer means root on the host. ### cgroups v1 vs v2 - **v1 (legacy):** separate hierarchies per controller (`/sys/fs/cgroup/memory`, `/sys/fs/cgroup/cpu`, etc.). Each process belongs to one cgroup per controller. - **v2 (modern):** single unified hierarchy. Each cgroup controls all enabled resources. Simpler, more consistent. - Most modern Linux distros (Fedora, Debian 11+, Ubuntu 22+) default to v2. Docker handles both. Check which your host uses: ```bash stat -fc %T /sys/fs/cgroup # 'cgroup2fs' = v2, 'tmpfs' = v1 ``` ### Common mistakes **Treating namespaces as security boundaries** Namespaces hide things from a process; they do not stop a process with the right capabilities from breaking out. Combined with capabilities (default-drop CAP_SYS_ADMIN, etc.) and seccomp (syscall filtering), they form a defense — but namespaces alone are not impenetrable. Real isolation needs the full Docker security stack (or microVMs like Kata/Firecracker). **Forgetting cgroups limit memory but not OOM behavior** ```bash docker run --memory=256m greedy-app # When greedy-app tries to allocate the 257th MB, kernel OOM-kills it. ``` OOM-killed processes inside cgroups exit 137. Your supervisor / restart-policy decides what to do next. **Confusing user namespace with `--user`** - `--user 1000:1000` = run the process as a specific UID inside the container's user namespace (still root inside if not remapped). - `userns-remap` = remap UIDs from container to host. Unrelated knob. **Assuming PID namespaces work like containers everywhere** Kubernetes pods can share PID namespaces between containers (the `shareProcessNamespace: true` flag). In plain Docker, two `docker run` containers always have separate PID namespaces. Different mental model. ### Real-world usage - **Every container, everywhere.** Namespaces and cgroups underlie every Docker, Podman, K8s pod, Lambda function (via Firecracker microVM), Cloud Run task. - **Hardened multi-tenant:** user namespace remapping + dropped capabilities + seccomp profile + read-only filesystem → strong-ish isolation. - **Resource governance:** cgroup limits prevent one tenant from starving others on shared infrastructure. - **Debugging container internals:** `lsns`, `nsenter`, `/proc/<pid>/ns/*` to inspect or join namespaces from outside. ### Follow-up questions **Q:** Are namespaces or cgroups Linux-specific? **A:** Yes. Both are Linux kernel features. Docker on Mac/Windows runs a Linux VM under the hood for this reason. **Q:** What is `unshare`? **A:** A syscall (and CLI tool) that creates a new namespace and attaches the calling process. `unshare --pid --fork --mount-proc /bin/bash` gives you a shell in fresh PID + mount namespaces — the simplest "build a container by hand" demo. **Q:** What is the difference between cgroups and ulimits? **A:** ulimits (resource limits via PAM, `setrlimit`) are per-process. cgroups apply to a tree of processes and survive across exec. Both can limit resources; cgroups are the kernel's modern, hierarchical answer. **Q:** Can I see what cgroup a host process is in? **A:** Yes: `cat /proc/<pid>/cgroup`. For a Docker process, this points into `/sys/fs/cgroup/<docker-cgroup-path>`. **Q:** (Senior) How would you build a minimal container by hand using just namespaces + cgroups? **A:** `unshare --pid --net --mount --uts --ipc --user --fork chroot /path/to/rootfs /bin/sh` gives you a process with isolated namespaces and a custom rootfs — the bare bones of a container. Add cgroups manually by writing to `/sys/fs/cgroup/<your-cgroup>/...`. This is roughly what `runc` does for you, automated and OCI-spec-compliant. Useful exercise to demystify what containers are. ## Examples ### Demonstrate namespace isolation ```bash # Host $ ps -ef | wc -l 312 # Inside container $ docker run --rm alpine ps -ef | wc -l 2 # Container only sees its own processes (PID 1 + ps itself) ``` Same kernel, two views — that is the PID namespace in action. ### Demonstrate cgroups ```bash # A memory-greedy program $ docker run --rm --memory=64m alpine sh -c 'dd if=/dev/zero of=/dev/null bs=1G count=1' Killed # Container OOM-killed at the 64MB limit. $ docker run --rm --cpus=0.1 alpine sh -c 'time -- timeout 5 sh -c "yes > /dev/null"' real 0m 5.00s user 0m 0.50s sys 0m 0.00s # user time = 0.5s in 5s wall = 10% CPU = the cap ``` The kernel enforces, Docker just sets the parameters. ### Mapping --user vs userns-remap ```bash # --user: changes UID inside (no remap) $ docker run --rm --user 1000:1000 alpine id uid=1000 gid=1000 # Outside, still UID 1000 (no isolation between in/out namespaces). # With userns-remap (daemon-level config): # /etc/docker/daemon.json: { "userns-remap": "default" } $ docker run --rm alpine id uid=0(root) gid=0(root) # Inside is root. $ ps -ef | grep <container-pid> UID 165536 ... # But on the host, the same process is UID 165536 (offset). ``` The second is much stronger isolation. Highly recommended for multi-tenant clusters.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.