Suggest an editImprove this articleRefine the answer for “Docker layers and Union File System explained”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)**Docker layers** are read-only filesystem deltas. Each Dockerfile instruction adds one layer; the image is the stack of layers. **Union File System** (OverlayFS in modern Docker) merges those layers into a single view, with one writable layer on top for the running container. ``` +--------------------------+ | writable layer (container) | <- changes here die with the container +--------------------------+ | layer 4: COPY app/ /app | <- added by Dockerfile +--------------------------+ | layer 3: RUN npm ci | +--------------------------+ | layer 2: WORKDIR /app | +--------------------------+ | layer 1: FROM node:22 | +--------------------------+ ``` **Key:** layers are deduplicated and cached. Two images sharing the same Node base store that base only once. Reorder Dockerfile instructions so the slow, stable ones land in lower (cached) layers; cheap, fast-changing ones go on top.Shown above the full answer for quick recall.Answer (EN)Image**Docker layers and the Union File System** are how Docker turns a series of Dockerfile instructions into a single, runnable image while keeping disk usage and build time reasonable. The whole image-caching story rests on this design. ## Theory ### TL;DR - Each Dockerfile instruction usually produces one **layer**: a read-only diff of the filesystem after the instruction ran. - An **image** is an ordered stack of layers + a config blob. Layers are immutable and identified by their content hash (SHA256). - A **Union File System** (OverlayFS in modern Docker) merges those read-only layers plus one **writable layer** into a single filesystem view that the container sees. - Layers are **deduplicated**: ten images sharing `python:3.13` keep that base on disk once. - Writes inside a running container go to its writable layer using **copy-on-write (CoW)**. They are lost when the container is removed - persistent state belongs in volumes. - Build cache hits when an instruction has the same inputs as a previous build. Reorder instructions so cheap, frequently-changing things sit on top; slow, stable things live underneath. ### Quick example ```bash $ docker history nginx:1.27-alpine IMAGE CREATED CREATED BY SIZE 4f06b3e2c0c1 2 weeks ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemo… 0B <missing> 2 weeks ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B <missing> 2 weeks ago /bin/sh -c #(nop) EXPOSE 80 0B <missing> 2 weeks ago /bin/sh -c set -x && addgroup -g 101 -S… 8.94MB <missing> 2 weeks ago /bin/sh -c #(nop) ENV NGINX_VERSION=1.27.4 0B <missing> 4 weeks ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B <missing> 4 weeks ago /bin/sh -c #(nop) ADD file:abcd1234… 7.79MB ``` Each row is one layer. The Alpine base (`ADD file:...`) is at the bottom; nginx-specific layers stack on top. Pull `nginx:1.27-alpine` and `node:22-alpine` and the Alpine base layer is shared - downloaded once, stored once. ### What a layer actually is A layer is a tarball containing: - The files added or modified by that instruction. - For deletions: a special whiteout file (`.wh.<name>`) telling the union FS "hide the file from layers below". So if `RUN apt-get install vim` adds `/usr/bin/vim`, that file lands in the layer's tar. If a later `RUN rm /usr/bin/vim` removes it, a `.wh.vim` whiteout appears in *that* layer - but the actual file is still on disk in the earlier layer. **Image size includes everything you ever added, even if you later deleted it.** Layers are content-addressed: their identity is the SHA256 of the tarball. Same tarball = same layer = stored once. ### Union File System (OverlayFS) Docker has supported several union FS drivers over the years (AUFS, btrfs, devicemapper, zfs, OverlayFS). On modern Linux (kernel 4.x+), **OverlayFS** is the default - it is fast, in the kernel proper, and well-maintained. OverlayFS combines four directories into one mount: - **lowerdir** - one or more read-only directories (your image layers, stacked). - **upperdir** - one read-write directory (the container's writable layer). - **workdir** - scratch directory used internally by the kernel. - **merged** - the unified view that the container sees as `/`. ``` +------------------+ | merged/ | <- what the container sees +------------------+ ↑ ↑ ↑ | | | +--------+ +--------+ +--------+ |upperdir| |lowerdir| |lowerdir| | (RW) | |layer N | |layer 1 | +--------+ +--------+ +--------+ ``` Reads check the upper layer first, fall through to lower layers. Writes go to upperdir. Modifying a file from a lower layer triggers **copy-up**: the file is copied into upperdir, then modified there. The original in the lower layer is untouched. ### Copy-on-write in action ```bash # Inside a container based on alpine: / # cat /etc/hostname a3f9d2b8c1e4 # That file is in the alpine layer (lower, RO) / # echo new-name > /etc/hostname # Now /etc/hostname is in the writable layer (upper). # The alpine layer's copy is unchanged - other containers from # the same image still see the original. ``` Copy-up is per file. Modifying one byte of a 100 MB file copies the entire 100 MB into the writable layer first - this is why "databases inside the writable layer" performs poorly. Use a volume. ### Build cache and layer reuse When `docker build` runs an instruction, it computes a cache key from: - The previous layer's digest (so the chain is deterministic). - The instruction itself (the `RUN` / `COPY` text). - For `COPY` and `ADD`: the digest of the files being copied. - For `RUN`: just the command string. Docker does NOT inspect what the command does; same string = cache hit even if `apt-get` would download new versions. If the cache key matches a prior build, Docker reuses the existing layer. If not, it runs the instruction and creates a new layer. **Once a step misses the cache, every step after it is also a miss.** ### Optimizing Dockerfile order for cache hits ```dockerfile # WRONG: source copied before deps installed FROM node:22-alpine WORKDIR /app COPY . . # any code change invalidates everything below RUN npm ci --omit=dev # re-runs every code change # RIGHT: deps installed before source copied FROM node:22-alpine WORKDIR /app COPY package*.json ./ # changes only when deps change RUN npm ci --omit=dev # cached unless package.json changed COPY . . # changes when source changes - only this re-runs ``` For a typical Node app with stable deps, this turns a 60-second rebuild into a 2-second one. The cached `npm ci` layer is reused as long as `package.json` and `package-lock.json` are unchanged. ### Image size implications ```dockerfile # WRONG: cleanup in a separate RUN doesn't help RUN apt-get update RUN apt-get install -y curl RUN rm -rf /var/lib/apt/lists/* # Layer 2 still contains the apt cache. # Layer 3 only adds whiteouts - the cache is on disk. # RIGHT: cleanup in same RUN as the install RUN apt-get update && \ apt-get install -y --no-install-recommends curl && \ rm -rf /var/lib/apt/lists/* # One layer, no cache files inside it. ``` Whiteouts hide files from the union view; they do not actually remove them from earlier layers. Cleanup must happen in the same `RUN` that created the mess. ### Common mistakes **Adding files in one layer, deleting in another, and expecting smaller image** It does not work. A 200 MB file added in layer 4 and deleted in layer 5 produces a 200 MB image, not zero. The whiteout only hides; the bytes are still there. Use multi-stage builds when you need to keep something around at build time but not in the final image. **Rebuilding `npm install` on every code change** Classic symptom: builds take 60 seconds even for a one-line change. Fix: install before copying source. The lock file should land in its own layer, the install runs against it, then source copies on top. **Modifying mounted files from a lower layer expecting it to be free** First write to a file from a lower layer triggers copy-up of the whole file. For a small config file, fine. For a multi-GB database file, slow and pointless - use a volume that bypasses the union FS entirely. **Using `RUN` to clone a giant repo and then deleting it** ```dockerfile # WRONG: the repo lives in a layer forever RUN git clone https://github.com/large/repo.git /tmp/r && \ cp /tmp/r/binary /usr/local/bin && \ rm -rf /tmp/r # Layer contains the repo + the binary; rm only adds whiteouts. # Wait - actually this IS in one RUN, so the cleanup is fine. ``` Slight gotcha: if it is all in one `RUN`, the layer is a snapshot at the END of the command, after the `rm`. So that pattern is OK. The mistake happens when each step is its own `RUN`. **Storing image layers on a slow disk** OverlayFS is fast, but it is bottlenecked by storage. Container starts on a network-mounted Docker root or a slow USB drive feel terrible. Keep `/var/lib/docker` on the same fast SSD as your kernel. ### Real-world usage - **Multi-stage builds** - the canonical way to keep build tools out of the final image. Stage 1 has compilers, stage 2 has only the binary; 1.5 GB build image becomes a 30 MB runtime image. - **Distroless and scratch images** - take the multi-stage idea further. Final stage is `FROM scratch` or `FROM gcr.io/distroless/base`, leaving only your binary on disk. No shell, no package manager, near-zero attack surface. - **BuildKit cache mounts** - `RUN --mount=type=cache,target=/root/.cache/pip ...` keeps a build cache between builds without baking it into a layer. Layer stays clean; cache lives outside the image. - **Layer-aware registries** - ECR, GCR, GHCR all dedupe blobs at the registry level. Pushing two images that share 90 percent of their layers only uploads the new 10 percent. ### Follow-up questions **Q:** Why not use one big layer for everything? **A:** You lose cache. With one layer, any change rebuilds the whole image. With many layers, only the layers below the change are reused. The trade-off is overhead per layer (each adds bookkeeping); modern best practice is roughly one logical step per layer. **Q:** Are Docker layers and Git commits similar? **A:** Conceptually similar (ordered diffs you can stack), mechanically different. Layers are tarballs of a filesystem; Git stores objects deduplicated by content hash. Both are content-addressed and immutable. The OCI image spec actually borrows ideas directly from how Git works. **Q:** What is `docker history --no-trunc <image>`? **A:** It shows every layer of an image with the full instruction that produced it. The most useful command for understanding why an image is so big and where the bulk lives. Pair with `dive` for an interactive layer inspector. **Q:** Why does Alpine produce smaller images than Debian or Ubuntu? **A:** Alpine uses musl instead of glibc and busybox instead of GNU coreutils. The base image is around 7 MB vs 30+ MB for Debian slim. Caveat: musl can cause subtle compatibility problems with binaries assuming glibc behavior; use Alpine when you control the binaries, Debian-slim when you do not. **Q:** (Senior) How does BuildKit make builds faster than the legacy builder? **A:** Several ways. Parallel stage execution (multi-stage stages without dependencies build in parallel). Smarter cache keys (`COPY` only invalidates when the actual files change, not on directory mtimes). Cache mounts that persist across builds without bloating layers. Frontend syntax (`# syntax=docker/dockerfile:1.7`) lets new Dockerfile features ship without daemon upgrades. Worth enabling everywhere now; it is the default on Docker 23+. ## Examples ### Inspecting layer sizes with `docker history` ```bash $ docker history --no-trunc node:22-alpine | awk '{print $7, $8}' | head -10 74.4MB /bin/sh -c addgroup -g 1000 node 80.6MB /bin/sh -c apk add --no-cache python3 ... 5.2MB /bin/sh -c #(nop) COPY file:abc... 0B /bin/sh -c #(nop) WORKDIR /home/node ``` The `apk add` line is 80 MB. That is the layer to attack first if you need a smaller image (consider a slimmer base or `--no-cache`). ### Demonstrating layer dedup with `docker pull` ```bash # First pull: downloads everything $ docker pull node:22-alpine 22-alpine: Pulling from library/node 9824c27679d3: Pull complete # alpine base f52e5f1a8a45: Pull complete # node binaries ca7239f1a5a6: Pull complete # node setup # Second pull: shares the alpine base $ docker pull python:3.13-alpine 3.13-alpine: Pulling from library/python 9824c27679d3: Already exists # SAME alpine base, not redownloaded 8b1d5c8d2e7f: Pull complete # python binaries ``` `9824c27679d3` is the Alpine base layer. Two different images share it, downloaded once, stored once on disk. The `Already exists` line is OverlayFS dedup at work. ### Multi-stage to drop a build layer ```dockerfile # Stage 1: full Node toolchain (~600 MB) FROM node:22-alpine AS build WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Stage 2: tiny runtime (~20 MB) FROM nginx:1.27-alpine COPY --from=build /app/dist /usr/share/nginx/html ``` The final image is `nginx:1.27-alpine` (≈20 MB) plus your built static files. The 600 MB build stage is discarded - it was needed to *produce* the artifact, not to *run* it. This is the layered model used to its full effect: build heavy, ship light.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.