How to implement Docker container security hardening?

docs.questions.sections.docker~6 min read

Container security hardening is the practice of reducing the attack surface and blast radius of containerized workloads. A container is a process group with shared kernel access; if compromised, the attacker is one kernel-bug away from the host. Hardening minimizes that risk through least-privilege, immutable filesystems, capability drops, and strong isolation policies. The CIS Docker Benchmark codifies the checklist.

Theory

TL;DR

Build-time: non-root USER, minimal base (distroless/scratch), no secrets in layers, scan for CVEs.
Run-time: --read-only, --cap-drop=ALL, no-new-privileges, seccomp profile, user-namespace remapping.
Host: kernel hardened, Docker daemon TLS-only, audit logs, AppArmor/SELinux.
Secrets: Docker secrets / mounted files / external vaults. Never env vars.
Network: isolated user-defined networks, no --network=host for app workloads.
Image supply: signed images (Docker Content Trust, cosign), private registries with RBAC.

Why containers need hardening

Container isolation is not a security boundary by default. The kernel is shared. By default, a container runs as root inside, which on Linux means full capabilities unless dropped. A misconfigured container can escape via:

Privileged mode
Mounting /var/run/docker.sock
Kernel exploits (rare but real)
Vulnerable application code + writable filesystem
Excessive capabilities (CAP_SYS_ADMIN, CAP_NET_ADMIN)

Hardening removes the most-common escape paths.

The hardening pyramid

            ┌─────────────────┐
            │  Image signing  │  ← supply chain
            ├─────────────────┤
            │   CVE scanning  │
            ├─────────────────┤
            │  Minimal base   │
            ├─────────────────┤
            │  Non-root USER  │  ← build-time
            ├─────────────────┤
            │ Drop caps, RO   │
            ├─────────────────┤
            │ Seccomp/AppArmor│  ← run-time
            ├─────────────────┤
            │ User-namespaces │
            ├─────────────────┤
            │ TLS daemon, RBAC│  ← host
            └─────────────────┘

Each layer reduces an attack class. Skipping USER and relying only on seccomp leaves easy wins for an attacker.

CIS Docker Benchmark categories

The benchmark groups checks:

Host configuration (Docker installation, audit, partition for /var/lib/docker)
Daemon configuration (TLS, audit, log level, ulimits)
Daemon files (perms on docker.sock, daemon.json)
Image and Build files (USER, no setuid, no dockerd in image)
Container runtime (cap-drop, seccomp, no privileged, ulimits)
Operations (image lifecycle, secrets, registries)

Use docker-bench-security (Docker Inc.'s open-source tool) for automated scanning.

Examples

Build-time: a hardened Dockerfile

dockerfile

FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# Static binary, no glibc dependency
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags='-s -w' -o /out/app

FROM gcr.io/distroless/static:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Why:

distroless/static has only the binary, no shell, no package manager. An attacker cannot drop into a shell to explore.
USER nonroot:nonroot runs as UID 65532, not root.
Multi-stage: the build toolchain stays out of the final image.
-ldflags='-s -w': strip symbols, smaller binary.

Run-time: full hardening flags

bash

docker run -d \
    --name=api \
    --user=10001:10001 \
    --read-only \
    --tmpfs=/tmp:size=64m,mode=1777 \
    --tmpfs=/run:size=4m \
    --cap-drop=ALL \
    --cap-add=NET_BIND_SERVICE \
    --security-opt=no-new-privileges \
    --security-opt=seccomp=/etc/docker/seccomp/default.json \
    --security-opt=apparmor=docker-default \
    --pids-limit=200 \
    --memory=512m --memory-swap=512m \
    --cpus=1.0 \
    --network=app-net \
    -p 8080:8080 \
    -v /etc/myapp/config.yaml:/etc/myapp/config.yaml:ro \
    myorg/api:1.0

Line by line:

--user: explicit UID/GID (overrides image's USER if set, ensures non-root).
--read-only: rootfs is read-only; the app cannot modify itself.
--tmpfs: writable scratch dirs in tmpfs (RAM, lost on stop).
--cap-drop=ALL: removes every Linux capability. App cannot bind low ports, change time, mount, etc.
--cap-add=NET_BIND_SERVICE: adds back only what is needed (bind to ports < 1024). Drop this if you bind on > 1024.
--security-opt=no-new-privileges: process cannot gain new caps via setuid binaries.
--security-opt=seccomp: restricts syscalls to a whitelist. Default is fine for most apps.
--security-opt=apparmor: applies AppArmor profile (default sets sane defaults).
--pids-limit: prevents fork-bomb DoS.
--memory --cpus: cgroup limits prevent resource exhaustion.
--network=app-net: isolated user-defined network (not default bridge, not host).
-v ...:ro: configs mounted read-only.

Linux capabilities reference

bash

# See which caps are loaded inside
docker exec api capsh --print
# Current: =
# Bounding set =
# Ambient set =

Common caps to keep:

NET_BIND_SERVICE — bind ports < 1024.
CHOWN, SETUID, SETGID — only if your app forks workers as different users.
DAC_OVERRIDE — only if you legitimately read files you do not own.

Never add:

SYS_ADMIN (massive scope; ~50 sub-capabilities).
SYS_PTRACE (read other processes' memory).
SYS_MODULE (load kernel modules).
SYS_RAWIO, NET_RAW (raw sockets, packet manipulation).

Seccomp profile basics

Seccomp filters block specific syscalls. The default Docker profile blocks ~44 risky syscalls (e.g., mount, reboot, kexec_load, ptrace).

bash

# Run without seccomp (NOT recommended)
docker run --security-opt=seccomp=unconfined ...

# Run with custom profile
docker run --security-opt=seccomp=/path/to/profile.json ...

A tighter profile blocks more. For a Go HTTP server you can drop clone, unshare, keyctl, etc.

Image scanning

bash

# Trivy: open-source, mature
trivy image --severity HIGH,CRITICAL myorg/api:1.0

# Docker Scout (built-in)
docker scout cves myorg/api:1.0

# Snyk
snyk container test myorg/api:1.0

Integrate into CI: fail the build on any HIGH/CRITICAL with a fix available.

Secrets management

Bad:

dockerfile

ENV DB_PASSWORD=hunter2     # Baked into the image, leaks via docker history

Bad:

bash

docker run -e DB_PASSWORD=hunter2 myorg/app   # Visible in process table, ps, /proc, kernel audit logs

Better:

bash

# Mount as a file
echo 'hunter2' | docker secret create db-pass -    # Swarm
docker run -v secret-db:/run/secrets/db-pass:ro myorg/app

# Or read from a vault at runtime
docker run -e VAULT_ROLE=app-prod myorg/app
# App fetches from HashiCorp Vault on startup

Docker Swarm secrets, Kubernetes secrets, HashiCorp Vault, AWS Secrets Manager — all let the app pull credentials at runtime, never persist them to the image.

Daemon hardening

In /etc/docker/daemon.json:

json

{
  "icc": false,
  "userns-remap": "default",
  "no-new-privileges": true,
  "log-driver": "json-file",
  "log-opts": {"max-size": "100m", "max-file": "3"},
  "live-restore": true,
  "userland-proxy": false
}

icc: false: containers cannot talk on default bridge by default; must use user-defined networks.
userns-remap: maps container UID 0 to a high host UID, so root inside is unprivileged outside.
live-restore: containers keep running across daemon restarts (less downtime, smaller window for an attacker to act on a daemon crash).

`--privileged` and Docker socket: avoid them

bash

docker run --privileged ...
# Equivalent to: drop all containerization. Equal to root on host.

docker run -v /var/run/docker.sock:/var/run/docker.sock ...
# The container can spawn other containers, including privileged ones. Equal to root on host.

If you need Docker-in-Docker for CI, use rootless DinD with a dedicated daemon per pipeline, never the host's socket.

Real-world usage

PCI/HIPAA workloads: full CIS Docker Benchmark; auditors expect it.
Multi-tenant clusters: user-namespace remap is non-negotiable.
Public-facing APIs: read-only FS + drop caps + seccomp default.
Internal services: at minimum non-root + drop caps + memory limits.
CI runners: ephemeral, but still: limited caps, no privileged unless explicitly Docker-in-Docker.

Common mistakes

Running as root inside the container

The nginx:alpine default user is root. Apps that don't override USER run as root. Always set USER in your Dockerfile or --user at run.

Mounting /var/run/docker.sock

Giving a container the socket gives it root on the host. Use rootless DinD or sysbox if you need container-in-container.

Putting secrets in env vars

Visible to anyone with docker inspect permission, leaks to logs and crash dumps. Mount as files.

Skipping image scanning

A node:18 image from 6 months ago has known CVEs. Pin and update; scan on every build.

Using --privileged for convenience

It is the equivalent of sudo -i. Almost never needed; use specific capability adds instead.

Follow-up questions

Q: What does --security-opt=no-new-privileges do?

A: Sets the kernel no_new_privs flag on the container's processes. They cannot gain new privileges via setuid binaries or capabilities. Combine with non-root USER for strong privilege containment.

Q: Should I use AppArmor or SELinux?

A: Use whatever your distro defaults to. Ubuntu uses AppArmor; RHEL/CentOS/Rocky use SELinux. Both add MAC (Mandatory Access Control). Docker provides default profiles for both. Custom profiles are powerful but maintenance-heavy.

Q: Is rootless Docker more secure than regular?

A: Yes for the host (compromise gives only the unprivileged user). Tradeoffs: limited port binding (< 1024 needs caps or proxy), no --network=host semantics, slightly slower (fuse-overlayfs on old kernels).

Q: (Senior) How do you reason about supply-chain security for container images?

A: Three layers: (1) provenance (signed images via cosign or Docker Content Trust, attestations of build), (2) SBOM (software bill of materials per image, scanned for CVEs), (3) policy (admission controllers like Kyverno or Gatekeeper that enforce signing/scanning before deploy). For high-stakes infra, build images yourself from trusted bases instead of pulling random Docker Hub images.

Q: (Senior) How do user namespaces change the security model?

A: With userns-remap, container UID 0 maps to a high host UID (e.g., 100000). A container that escapes still has unprivileged host access. Tradeoffs: image layers under remapped UIDs cannot be shared with non-remapped daemons; bind mounts need correct ownership; some software detects "not real root" and breaks. Worth the cost on multi-tenant hosts.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?