Skip to main content

Docker layers and Union File System explained

Docker layers and the Union File System are how Docker turns a series of Dockerfile instructions into a single, runnable image while keeping disk usage and build time reasonable. The whole image-caching story rests on this design.

Theory

TL;DR

  • Each Dockerfile instruction usually produces one layer: a read-only diff of the filesystem after the instruction ran.
  • An image is an ordered stack of layers + a config blob. Layers are immutable and identified by their content hash (SHA256).
  • A Union File System (OverlayFS in modern Docker) merges those read-only layers plus one writable layer into a single filesystem view that the container sees.
  • Layers are deduplicated: ten images sharing python:3.13 keep that base on disk once.
  • Writes inside a running container go to its writable layer using copy-on-write (CoW). They are lost when the container is removed - persistent state belongs in volumes.
  • Build cache hits when an instruction has the same inputs as a previous build. Reorder instructions so cheap, frequently-changing things sit on top; slow, stable things live underneath.

Quick example

bash
$ docker history nginx:1.27-alpine IMAGE CREATED CREATED BY SIZE 4f06b3e2c0c1 2 weeks ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemo… 0B <missing> 2 weeks ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B <missing> 2 weeks ago /bin/sh -c #(nop) EXPOSE 80 0B <missing> 2 weeks ago /bin/sh -c set -x && addgroup -g 101 -S… 8.94MB <missing> 2 weeks ago /bin/sh -c #(nop) ENV NGINX_VERSION=1.27.4 0B <missing> 4 weeks ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B <missing> 4 weeks ago /bin/sh -c #(nop) ADD file:abcd1234… 7.79MB

Each row is one layer. The Alpine base (ADD file:...) is at the bottom; nginx-specific layers stack on top. Pull nginx:1.27-alpine and node:22-alpine and the Alpine base layer is shared - downloaded once, stored once.

What a layer actually is

A layer is a tarball containing:

  • The files added or modified by that instruction.
  • For deletions: a special whiteout file (.wh.<name>) telling the union FS "hide the file from layers below".

So if RUN apt-get install vim adds /usr/bin/vim, that file lands in the layer's tar. If a later RUN rm /usr/bin/vim removes it, a .wh.vim whiteout appears in that layer - but the actual file is still on disk in the earlier layer. Image size includes everything you ever added, even if you later deleted it.

Layers are content-addressed: their identity is the SHA256 of the tarball. Same tarball = same layer = stored once.

Union File System (OverlayFS)

Docker has supported several union FS drivers over the years (AUFS, btrfs, devicemapper, zfs, OverlayFS). On modern Linux (kernel 4.x+), OverlayFS is the default - it is fast, in the kernel proper, and well-maintained.

OverlayFS combines four directories into one mount:

  • lowerdir - one or more read-only directories (your image layers, stacked).
  • upperdir - one read-write directory (the container's writable layer).
  • workdir - scratch directory used internally by the kernel.
  • merged - the unified view that the container sees as /.
+------------------+ | merged/ | <- what the container sees +------------------+ ↑ ↑ ↑ | | | +--------+ +--------+ +--------+ |upperdir| |lowerdir| |lowerdir| | (RW) | |layer N | |layer 1 | +--------+ +--------+ +--------+

Reads check the upper layer first, fall through to lower layers. Writes go to upperdir. Modifying a file from a lower layer triggers copy-up: the file is copied into upperdir, then modified there. The original in the lower layer is untouched.

Copy-on-write in action

bash
# Inside a container based on alpine: / # cat /etc/hostname a3f9d2b8c1e4 # That file is in the alpine layer (lower, RO) / # echo new-name > /etc/hostname # Now /etc/hostname is in the writable layer (upper). # The alpine layer's copy is unchanged - other containers from # the same image still see the original.

Copy-up is per file. Modifying one byte of a 100 MB file copies the entire 100 MB into the writable layer first - this is why "databases inside the writable layer" performs poorly. Use a volume.

Build cache and layer reuse

When docker build runs an instruction, it computes a cache key from:

  • The previous layer's digest (so the chain is deterministic).
  • The instruction itself (the RUN / COPY text).
  • For COPY and ADD: the digest of the files being copied.
  • For RUN: just the command string. Docker does NOT inspect what the command does; same string = cache hit even if apt-get would download new versions.

If the cache key matches a prior build, Docker reuses the existing layer. If not, it runs the instruction and creates a new layer. Once a step misses the cache, every step after it is also a miss.

Optimizing Dockerfile order for cache hits

dockerfile
# WRONG: source copied before deps installed FROM node:22-alpine WORKDIR /app COPY . . # any code change invalidates everything below RUN npm ci --omit=dev # re-runs every code change # RIGHT: deps installed before source copied FROM node:22-alpine WORKDIR /app COPY package*.json ./ # changes only when deps change RUN npm ci --omit=dev # cached unless package.json changed COPY . . # changes when source changes - only this re-runs

For a typical Node app with stable deps, this turns a 60-second rebuild into a 2-second one. The cached npm ci layer is reused as long as package.json and package-lock.json are unchanged.

Image size implications

dockerfile
# WRONG: cleanup in a separate RUN doesn't help RUN apt-get update RUN apt-get install -y curl RUN rm -rf /var/lib/apt/lists/* # Layer 2 still contains the apt cache. # Layer 3 only adds whiteouts - the cache is on disk. # RIGHT: cleanup in same RUN as the install RUN apt-get update && \ apt-get install -y --no-install-recommends curl && \ rm -rf /var/lib/apt/lists/* # One layer, no cache files inside it.

Whiteouts hide files from the union view; they do not actually remove them from earlier layers. Cleanup must happen in the same RUN that created the mess.

Common mistakes

Adding files in one layer, deleting in another, and expecting smaller image

It does not work. A 200 MB file added in layer 4 and deleted in layer 5 produces a 200 MB image, not zero. The whiteout only hides; the bytes are still there. Use multi-stage builds when you need to keep something around at build time but not in the final image.

Rebuilding npm install on every code change

Classic symptom: builds take 60 seconds even for a one-line change. Fix: install before copying source. The lock file should land in its own layer, the install runs against it, then source copies on top.

Modifying mounted files from a lower layer expecting it to be free

First write to a file from a lower layer triggers copy-up of the whole file. For a small config file, fine. For a multi-GB database file, slow and pointless - use a volume that bypasses the union FS entirely.

Using RUN to clone a giant repo and then deleting it

dockerfile
# WRONG: the repo lives in a layer forever RUN git clone https://github.com/large/repo.git /tmp/r && \ cp /tmp/r/binary /usr/local/bin && \ rm -rf /tmp/r # Layer contains the repo + the binary; rm only adds whiteouts. # Wait - actually this IS in one RUN, so the cleanup is fine.

Slight gotcha: if it is all in one RUN, the layer is a snapshot at the END of the command, after the rm. So that pattern is OK. The mistake happens when each step is its own RUN.

Storing image layers on a slow disk

OverlayFS is fast, but it is bottlenecked by storage. Container starts on a network-mounted Docker root or a slow USB drive feel terrible. Keep /var/lib/docker on the same fast SSD as your kernel.

Real-world usage

  • Multi-stage builds - the canonical way to keep build tools out of the final image. Stage 1 has compilers, stage 2 has only the binary; 1.5 GB build image becomes a 30 MB runtime image.
  • Distroless and scratch images - take the multi-stage idea further. Final stage is FROM scratch or FROM gcr.io/distroless/base, leaving only your binary on disk. No shell, no package manager, near-zero attack surface.
  • BuildKit cache mounts - RUN --mount=type=cache,target=/root/.cache/pip ... keeps a build cache between builds without baking it into a layer. Layer stays clean; cache lives outside the image.
  • Layer-aware registries - ECR, GCR, GHCR all dedupe blobs at the registry level. Pushing two images that share 90 percent of their layers only uploads the new 10 percent.

Follow-up questions

Q: Why not use one big layer for everything?


A: You lose cache. With one layer, any change rebuilds the whole image. With many layers, only the layers below the change are reused. The trade-off is overhead per layer (each adds bookkeeping); modern best practice is roughly one logical step per layer.

Q: Are Docker layers and Git commits similar?


A: Conceptually similar (ordered diffs you can stack), mechanically different. Layers are tarballs of a filesystem; Git stores objects deduplicated by content hash. Both are content-addressed and immutable. The OCI image spec actually borrows ideas directly from how Git works.

Q: What is docker history --no-trunc <image>?


A: It shows every layer of an image with the full instruction that produced it. The most useful command for understanding why an image is so big and where the bulk lives. Pair with dive for an interactive layer inspector.

Q: Why does Alpine produce smaller images than Debian or Ubuntu?


A: Alpine uses musl instead of glibc and busybox instead of GNU coreutils. The base image is around 7 MB vs 30+ MB for Debian slim. Caveat: musl can cause subtle compatibility problems with binaries assuming glibc behavior; use Alpine when you control the binaries, Debian-slim when you do not.

Q: (Senior) How does BuildKit make builds faster than the legacy builder?


A: Several ways. Parallel stage execution (multi-stage stages without dependencies build in parallel). Smarter cache keys (COPY only invalidates when the actual files change, not on directory mtimes). Cache mounts that persist across builds without bloating layers. Frontend syntax (# syntax=docker/dockerfile:1.7) lets new Dockerfile features ship without daemon upgrades. Worth enabling everywhere now; it is the default on Docker 23+.

Examples

Inspecting layer sizes with docker history

bash
$ docker history --no-trunc node:22-alpine | awk '{print $7, $8}' | head -10 74.4MB /bin/sh -c addgroup -g 1000 node 80.6MB /bin/sh -c apk add --no-cache python3 ... 5.2MB /bin/sh -c #(nop) COPY file:abc... 0B /bin/sh -c #(nop) WORKDIR /home/node

The apk add line is 80 MB. That is the layer to attack first if you need a smaller image (consider a slimmer base or --no-cache).

Demonstrating layer dedup with docker pull

bash
# First pull: downloads everything $ docker pull node:22-alpine 22-alpine: Pulling from library/node 9824c27679d3: Pull complete # alpine base f52e5f1a8a45: Pull complete # node binaries ca7239f1a5a6: Pull complete # node setup # Second pull: shares the alpine base $ docker pull python:3.13-alpine 3.13-alpine: Pulling from library/python 9824c27679d3: Already exists # SAME alpine base, not redownloaded 8b1d5c8d2e7f: Pull complete # python binaries

9824c27679d3 is the Alpine base layer. Two different images share it, downloaded once, stored once on disk. The Already exists line is OverlayFS dedup at work.

Multi-stage to drop a build layer

dockerfile
# Stage 1: full Node toolchain (~600 MB) FROM node:22-alpine AS build WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Stage 2: tiny runtime (~20 MB) FROM nginx:1.27-alpine COPY --from=build /app/dist /usr/share/nginx/html

The final image is nginx:1.27-alpine (≈20 MB) plus your built static files. The 600 MB build stage is discarded - it was needed to produce the artifact, not to run it. This is the layered model used to its full effect: build heavy, ship light.

Short Answer

Interview ready
Premium

A concise answer to help you respond confidently on this topic during an interview.

Comments

No comments yet