BuildKit, Docker’s next-gen builder, can drastically reduce your image sizes by intelligently handling layer caching and deduplication.
Here’s how it works in practice:
Imagine you have a Dockerfile like this:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
jq \
&& rm -rf /var/lib/apt/lists/*
COPY app.py /app/app.py
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
CMD ["python", "/app/app.py"]
When you build this with the legacy docker build, each RUN and COPY command creates a new layer. If you change app.py but not requirements.txt, the RUN pip install layer is re-executed and a new layer is created, even though the installed packages haven’t changed. This bloats your image.
With BuildKit, the build process is analyzed differently. It doesn’t just execute commands sequentially; it understands the dependencies and outputs of each step.
Let’s see BuildKit in action. First, ensure you have BuildKit enabled. You can do this by setting the DOCKER_BUILDKIT=1 environment variable:
export DOCKER_BUILDKIT=1
docker build -t my-app .
Now, consider this scenario:
-
Initial Build: Build the image. BuildKit analyzes the
Dockerfile. It seesapt-get installand records the exact package versions installed. It seespip installand records the exact versions of requirements installed. All this information is stored in its cache. -
Change
app.pyonly:FROM ubuntu:22.04 RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ jq \ && rm -rf /var/lib/apt/lists/* COPY app.py /app/app.py # Changed this file COPY requirements.txt /app/requirements.txt RUN pip install --no-cache-dir -r /app/requirements.txt CMD ["python", "/app/app.py"]When you run
docker buildagain, BuildKit detects thatapp.pyhas changed. However, it also knows that theRUN pip installcommand only depends onrequirements.txtand the base image’s Python environment, neither of which have changed. Therefore, BuildKit skips thepip installstep entirely, reusing the layer from the previous build. TheCOPY app.pycreates a new layer, and theCMDcreates another, but the largepip installlayer is reused. -
Change
requirements.txt:FROM ubuntu:22.04 RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ jq \ && rm -rf /var/lib/apt/lists/* COPY app.py /app/app.py COPY requirements.txt /app/requirements.txt # Changed this file RUN pip install --no-cache-dir -r /app/requirements.txt # This will re-run CMD ["python", "/app/app.py"]Now, if
requirements.txtchanges, BuildKit correctly identifies that thepip installstep’s inputs have changed. It will re-executepip install, creating a new layer for it, and all subsequent layers will also be rebuilt.
This granular caching is a core part of BuildKit. It also introduces several other optimizations:
- Parallelism: BuildKit can execute independent build steps concurrently, speeding up builds significantly.
- Deduplication: BuildKit can identify identical files or layers across different builds or even within the same build and share them, further reducing disk space and network transfer.
- Remote Caching: You can push your build cache to a remote registry (like Docker Hub or a private registry) and pull it down on other machines or CI/CD agents, enabling distributed caching.
The most surprising thing about BuildKit is how it treats RUN commands. It doesn’t just see them as opaque shell scripts. It analyzes the actual files being read and written by the commands. For example, in RUN apt-get update && apt-get install -y --no-install-recommends curl jq && rm -rf /var/lib/apt/lists/*, BuildKit understands that the apt-get update reads from /var/lib/apt/lists/ and the apt-get install writes to various locations like /usr/bin/curl and /usr/bin/jq. The rm -rf /var/lib/apt/lists/* cleans up the output of apt-get update, making the layer smaller. BuildKit’s cache key for this RUN command incorporates the actual package names and versions requested, and its effective output, ignoring transient files like those in /var/lib/apt/lists/ after cleanup.
To leverage BuildKit’s full potential, especially for multi-stage builds, you can use the --mount type cache. This allows you to define specific cache directories that persist across builds for particular steps.
Consider this pattern for caching pip dependencies:
# syntax=docker/dockerfile:1
FROM ubuntu:22.04 AS builder
# Use a cache mount for pip dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
apt-get update && apt-get install -y python3 python3-pip && \
pip3 install --no-cache-dir -r requirements.txt
COPY . .
RUN python3 app.py # This would use the installed dependencies
When you build this, the pip install will populate the cache at /root/.cache/pip. The next time you build and this RUN command is executed, BuildKit will check the cache. If the requirements.txt hasn’t changed, the pip install will be skipped, and the cached dependencies will be available. If requirements.txt changes, the cache will be invalidated for that specific step, and pip install will re-run.
The most impactful optimization BuildKit offers is its ability to understand the semantics of commands, not just their textual representation. It can track file dependencies and outputs at a much deeper level than the legacy builder, leading to more effective cache hits and smaller, more efficient images. For instance, if you have a RUN command that copies a large set of files and then only modifies one small file within that set, BuildKit can often identify that only the layer related to that single file modification needs to be new, while the bulk of the copied files can be reused from a previous layer. This is achieved through content-addressable storage and sophisticated dependency graph analysis.
The next hurdle you’ll likely encounter is managing remote build cache configurations for team collaboration.