Docker layers and Union File System explained
Docker layers and the Union File System are how Docker turns a series of Dockerfile instructions into a single, runnable image while keeping disk usage and build time reasonable. The whole image-caching story rests on this design.
Theory
TL;DR
- Each Dockerfile instruction usually produces one layer: a read-only diff of the filesystem after the instruction ran.
- An image is an ordered stack of layers + a config blob. Layers are immutable and identified by their content hash (SHA256).
- A Union File System (OverlayFS in modern Docker) merges those read-only layers plus one writable layer into a single filesystem view that the container sees.
- Layers are deduplicated: ten images sharing
python:3.13keep that base on disk once. - Writes inside a running container go to its writable layer using copy-on-write (CoW). They are lost when the container is removed - persistent state belongs in volumes.
- Build cache hits when an instruction has the same inputs as a previous build. Reorder instructions so cheap, frequently-changing things sit on top; slow, stable things live underneath.
Quick example
$ docker history nginx:1.27-alpine
IMAGE CREATED CREATED BY SIZE
4f06b3e2c0c1 2 weeks ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemo… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B
<missing> 2 weeks ago /bin/sh -c #(nop) EXPOSE 80 0B
<missing> 2 weeks ago /bin/sh -c set -x && addgroup -g 101 -S… 8.94MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV NGINX_VERSION=1.27.4 0B
<missing> 4 weeks ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B
<missing> 4 weeks ago /bin/sh -c #(nop) ADD file:abcd1234… 7.79MBEach row is one layer. The Alpine base (ADD file:...) is at the bottom; nginx-specific layers stack on top. Pull nginx:1.27-alpine and node:22-alpine and the Alpine base layer is shared - downloaded once, stored once.
What a layer actually is
A layer is a tarball containing:
- The files added or modified by that instruction.
- For deletions: a special whiteout file (
.wh.<name>) telling the union FS "hide the file from layers below".
So if RUN apt-get install vim adds /usr/bin/vim, that file lands in the layer's tar. If a later RUN rm /usr/bin/vim removes it, a .wh.vim whiteout appears in that layer - but the actual file is still on disk in the earlier layer. Image size includes everything you ever added, even if you later deleted it.
Layers are content-addressed: their identity is the SHA256 of the tarball. Same tarball = same layer = stored once.
Union File System (OverlayFS)
Docker has supported several union FS drivers over the years (AUFS, btrfs, devicemapper, zfs, OverlayFS). On modern Linux (kernel 4.x+), OverlayFS is the default - it is fast, in the kernel proper, and well-maintained.
OverlayFS combines four directories into one mount:
- lowerdir - one or more read-only directories (your image layers, stacked).
- upperdir - one read-write directory (the container's writable layer).
- workdir - scratch directory used internally by the kernel.
- merged - the unified view that the container sees as
/.
+------------------+
| merged/ | <- what the container sees
+------------------+
↑ ↑ ↑
| | |
+--------+ +--------+ +--------+
|upperdir| |lowerdir| |lowerdir|
| (RW) | |layer N | |layer 1 |
+--------+ +--------+ +--------+Reads check the upper layer first, fall through to lower layers. Writes go to upperdir. Modifying a file from a lower layer triggers copy-up: the file is copied into upperdir, then modified there. The original in the lower layer is untouched.
Copy-on-write in action
# Inside a container based on alpine:
/ # cat /etc/hostname
a3f9d2b8c1e4
# That file is in the alpine layer (lower, RO)
/ # echo new-name > /etc/hostname
# Now /etc/hostname is in the writable layer (upper).
# The alpine layer's copy is unchanged - other containers from
# the same image still see the original.Copy-up is per file. Modifying one byte of a 100 MB file copies the entire 100 MB into the writable layer first - this is why "databases inside the writable layer" performs poorly. Use a volume.
Build cache and layer reuse
When docker build runs an instruction, it computes a cache key from:
- The previous layer's digest (so the chain is deterministic).
- The instruction itself (the
RUN/COPYtext). - For
COPYandADD: the digest of the files being copied. - For
RUN: just the command string. Docker does NOT inspect what the command does; same string = cache hit even ifapt-getwould download new versions.
If the cache key matches a prior build, Docker reuses the existing layer. If not, it runs the instruction and creates a new layer. Once a step misses the cache, every step after it is also a miss.
Optimizing Dockerfile order for cache hits
# WRONG: source copied before deps installed
FROM node:22-alpine
WORKDIR /app
COPY . . # any code change invalidates everything below
RUN npm ci --omit=dev # re-runs every code change
# RIGHT: deps installed before source copied
FROM node:22-alpine
WORKDIR /app
COPY package*.json ./ # changes only when deps change
RUN npm ci --omit=dev # cached unless package.json changed
COPY . . # changes when source changes - only this re-runsFor a typical Node app with stable deps, this turns a 60-second rebuild into a 2-second one. The cached npm ci layer is reused as long as package.json and package-lock.json are unchanged.
Image size implications
# WRONG: cleanup in a separate RUN doesn't help
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*
# Layer 2 still contains the apt cache.
# Layer 3 only adds whiteouts - the cache is on disk.
# RIGHT: cleanup in same RUN as the install
RUN apt-get update && \
apt-get install -y --no-install-recommends curl && \
rm -rf /var/lib/apt/lists/*
# One layer, no cache files inside it.Whiteouts hide files from the union view; they do not actually remove them from earlier layers. Cleanup must happen in the same RUN that created the mess.
Common mistakes
Adding files in one layer, deleting in another, and expecting smaller image
It does not work. A 200 MB file added in layer 4 and deleted in layer 5 produces a 200 MB image, not zero. The whiteout only hides; the bytes are still there. Use multi-stage builds when you need to keep something around at build time but not in the final image.
Rebuilding npm install on every code change
Classic symptom: builds take 60 seconds even for a one-line change. Fix: install before copying source. The lock file should land in its own layer, the install runs against it, then source copies on top.
Modifying mounted files from a lower layer expecting it to be free
First write to a file from a lower layer triggers copy-up of the whole file. For a small config file, fine. For a multi-GB database file, slow and pointless - use a volume that bypasses the union FS entirely.
Using RUN to clone a giant repo and then deleting it
# WRONG: the repo lives in a layer forever
RUN git clone https://github.com/large/repo.git /tmp/r && \
cp /tmp/r/binary /usr/local/bin && \
rm -rf /tmp/r
# Layer contains the repo + the binary; rm only adds whiteouts.
# Wait - actually this IS in one RUN, so the cleanup is fine.Slight gotcha: if it is all in one RUN, the layer is a snapshot at the END of the command, after the rm. So that pattern is OK. The mistake happens when each step is its own RUN.
Storing image layers on a slow disk
OverlayFS is fast, but it is bottlenecked by storage. Container starts on a network-mounted Docker root or a slow USB drive feel terrible. Keep /var/lib/docker on the same fast SSD as your kernel.
Real-world usage
- Multi-stage builds - the canonical way to keep build tools out of the final image. Stage 1 has compilers, stage 2 has only the binary; 1.5 GB build image becomes a 30 MB runtime image.
- Distroless and scratch images - take the multi-stage idea further. Final stage is
FROM scratchorFROM gcr.io/distroless/base, leaving only your binary on disk. No shell, no package manager, near-zero attack surface. - BuildKit cache mounts -
RUN --mount=type=cache,target=/root/.cache/pip ...keeps a build cache between builds without baking it into a layer. Layer stays clean; cache lives outside the image. - Layer-aware registries - ECR, GCR, GHCR all dedupe blobs at the registry level. Pushing two images that share 90 percent of their layers only uploads the new 10 percent.
Follow-up questions
Q: Why not use one big layer for everything?
A: You lose cache. With one layer, any change rebuilds the whole image. With many layers, only the layers below the change are reused. The trade-off is overhead per layer (each adds bookkeeping); modern best practice is roughly one logical step per layer.
Q: Are Docker layers and Git commits similar?
A: Conceptually similar (ordered diffs you can stack), mechanically different. Layers are tarballs of a filesystem; Git stores objects deduplicated by content hash. Both are content-addressed and immutable. The OCI image spec actually borrows ideas directly from how Git works.
Q: What is docker history --no-trunc <image>?
A: It shows every layer of an image with the full instruction that produced it. The most useful command for understanding why an image is so big and where the bulk lives. Pair with dive for an interactive layer inspector.
Q: Why does Alpine produce smaller images than Debian or Ubuntu?
A: Alpine uses musl instead of glibc and busybox instead of GNU coreutils. The base image is around 7 MB vs 30+ MB for Debian slim. Caveat: musl can cause subtle compatibility problems with binaries assuming glibc behavior; use Alpine when you control the binaries, Debian-slim when you do not.
Q: (Senior) How does BuildKit make builds faster than the legacy builder?
A: Several ways. Parallel stage execution (multi-stage stages without dependencies build in parallel). Smarter cache keys (COPY only invalidates when the actual files change, not on directory mtimes). Cache mounts that persist across builds without bloating layers. Frontend syntax (# syntax=docker/dockerfile:1.7) lets new Dockerfile features ship without daemon upgrades. Worth enabling everywhere now; it is the default on Docker 23+.
Examples
Inspecting layer sizes with docker history
$ docker history --no-trunc node:22-alpine | awk '{print $7, $8}' | head -10
74.4MB /bin/sh -c addgroup -g 1000 node
80.6MB /bin/sh -c apk add --no-cache python3 ...
5.2MB /bin/sh -c #(nop) COPY file:abc...
0B /bin/sh -c #(nop) WORKDIR /home/nodeThe apk add line is 80 MB. That is the layer to attack first if you need a smaller image (consider a slimmer base or --no-cache).
Demonstrating layer dedup with docker pull
# First pull: downloads everything
$ docker pull node:22-alpine
22-alpine: Pulling from library/node
9824c27679d3: Pull complete # alpine base
f52e5f1a8a45: Pull complete # node binaries
ca7239f1a5a6: Pull complete # node setup
# Second pull: shares the alpine base
$ docker pull python:3.13-alpine
3.13-alpine: Pulling from library/python
9824c27679d3: Already exists # SAME alpine base, not redownloaded
8b1d5c8d2e7f: Pull complete # python binaries9824c27679d3 is the Alpine base layer. Two different images share it, downloaded once, stored once on disk. The Already exists line is OverlayFS dedup at work.
Multi-stage to drop a build layer
# Stage 1: full Node toolchain (~600 MB)
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Stage 2: tiny runtime (~20 MB)
FROM nginx:1.27-alpine
COPY /app/dist /usr/share/nginx/htmlThe final image is nginx:1.27-alpine (≈20 MB) plus your built static files. The 600 MB build stage is discarded - it was needed to produce the artifact, not to run it. This is the layered model used to its full effect: build heavy, ship light.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.
Comments
No comments yet