Docker images are built in layers, and each instruction in your Dockerfile creates a new layer. If you’re not careful, these layers can quickly bloat your image size, leading to slower builds, increased storage costs, and longer deployment times.
Let’s see this in action. Imagine a simple Dockerfile that installs nginx and then copies a custom configuration file:
FROM ubuntu:latest
RUN apt-get update && apt-get install -y nginx
COPY nginx.conf /etc/nginx/nginx.conf
CMD ["nginx", "-g", "daemon off;"]
When you build this, Docker creates a layer for FROM, a layer for RUN apt-get update && apt-get install -y nginx, and another for COPY nginx.conf. Each layer is stored and can be pulled independently. If you later change nginx.conf, only that last layer needs to be re-pulled, which is efficient. But if you change the RUN command, all subsequent layers effectively become invalid and must be rebuilt, even if their content hasn’t changed.
The core problem this solves is managing the cumulative effect of discrete build steps. Each instruction in a Dockerfile is an instruction to the Docker daemon to perform an action, and that action, if it modifies the filesystem, results in a new layer. If you perform multiple unrelated operations in a single RUN command, they all get bundled into one layer. If you do them in separate RUN commands, you get multiple layers, each containing the changes from that specific command.
Here’s how the system works internally: Docker uses a layered filesystem (like OverlayFS). When you build an image, Docker starts with a base image (its own layers) and then applies each instruction sequentially. Each instruction that modifies the filesystem creates a new read-only layer on top of the previous one. The final image is a stack of these read-only layers, with a writable layer on top for running containers. When you pull an image, Docker pulls only the layers it doesn’t already have.
You control image size and efficiency primarily through the Dockerfile. Key levers include:
- Combining
RUNcommands: Bundle related commands together to reduce the number of layers. - Cleaning up after installations: Remove package manager caches and temporary files within the same
RUNcommand that installed them. - Using smaller base images: Opt for images like
alpineordebian-sliminstead of full-blown OS images. - Multi-stage builds: Use intermediate build stages to compile code or build artifacts, then copy only the necessary output to a final, minimal runtime image.
Let’s refine our nginx example using some of these principles. Instead of separate apt-get update and apt-get install, we combine them and clean up:
FROM ubuntu:latest
RUN apt-get update && \
apt-get install -y nginx && \
rm -rf /var/lib/apt/lists/*
COPY nginx.conf /etc/nginx/nginx.conf
CMD ["nginx", "-g", "daemon off;"]
This is better. The rm -rf /var/lib/apt/lists/* cleans up the package cache within the same layer that installed nginx. If this cleanup happened in a separate RUN command, the cache files would still exist in an earlier layer, contributing to the overall image size.
Now, consider a multi-stage build. If you were building a Go application, a typical Dockerfile might look like this:
# Stage 1: Builder
FROM golang:1.20-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp
# Stage 2: Runner
FROM alpine:latest
COPY --from=builder /app/myapp /usr/local/bin/myapp
CMD ["myapp"]
Here, the first stage (builder) uses a Go image to compile your application. This stage can be quite large because it contains the Go SDK and all its dependencies. The second stage (runner) starts from a tiny alpine image and only copies the compiled executable (myapp) from the builder stage. The entire Go toolchain and source code are discarded, resulting in a dramatically smaller final image.
The one thing most people don’t realize is that Docker’s layer caching works on a per-instruction basis. If you have a RUN command that installs dependencies, and you then add a new dependency above that command in your Dockerfile, Docker will invalidate the cache for that RUN command and all subsequent commands, forcing a rebuild of everything. This is why you often see Dockerfiles that copy application code after installing dependencies. If you change your code, only the COPY instruction and subsequent layers are rebuilt, not the entire dependency installation process.
The next concept you’ll likely encounter is optimizing the order of operations within your Dockerfile to maximize layer cache utilization.