Linux Capabilities are the most surprising way to manage container privilege, not by removing root, but by surgically removing specific root powers.
Let’s see them in action. Imagine a simple Go program that needs to bind to a privileged port (<1024).
package main
import (
"fmt"
"net"
"os"
)
func main() {
port := "80"
if os.Getuid() == 0 {
fmt.Printf("Running as root, binding to port %s\n", port)
} else {
fmt.Printf("Running as non-root, attempting to bind to port %s\n", port)
}
listener, err := net.Listen("tcp", ":"+port)
if err != nil {
fmt.Fprintf(os.Stderr, "Error listening on port %s: %v\n", port, err)
os.Exit(1)
}
defer listener.Close()
fmt.Printf("Successfully listening on port %s\n", port)
}
If we build this and run it as a regular user in a container:
$ go build -o server
$ ./server
Running as non-root, attempting to bind to port 80
Error listening on port 80: listen tcp :80: bind: permission denied
It fails, as expected. Now, if we run it as root:
$ docker run --rm -v $(pwd):/app -w /app --user 0 --entrypoint ./server my-go-app
Running as root, binding to port 80
Successfully listening on port 80
This works, but gives the container full root privileges. We can do better.
Linux Capabilities break down the monolithic "root" user into distinct privileges. Instead of granting all root powers, you grant only the specific powers a process needs. For our net.Listen example, the required capability is CAP_NET_BIND_SERVICE.
We can grant this capability to our container using docker run --cap-add:
$ docker run --rm -v $(pwd):/app -w /app --user 0 --cap-add NET_BIND_SERVICE --entrypoint ./server my-go-app
Running as root, binding to port 80
Successfully listening on port 80
This still runs as user 0 inside the container, but the kernel only allows it to perform the NET_BIND_SERVICE operation. If the process tried to do something else root-only, like modify /etc/passwd, it would fail.
The mental model here is a spectrum of privileges. Traditionally, it was root (all powers) or non-root (no powers). Capabilities introduce granular control, allowing us to give a process some root powers without giving it all root powers. This is achieved by associating specific capabilities with processes. When a process attempts an operation that requires a capability, the kernel checks if that capability is currently granted to the process.
The key is to identify the minimum set of capabilities your application needs. For network services, CAP_NET_BIND_SERVICE is common. For manipulating network interfaces, CAP_NET_ADMIN is needed. For reading sensitive kernel logs, CAP_SYSLOG might be required. The strace command is your best friend here. Run your application (initially as root, if necessary) with strace -P /proc/kcore or strace -e trace=network to see what system calls are being made and what privileges they might require. Then, consult the capabilities(7) man page for a detailed breakdown.
For example, a container that needs to modify network routes might require CAP_NET_ADMIN. Instead of running it as root, you’d run it as a non-root user (say, 1000) and add the capability:
docker run --rm --user 1000 --cap-add NET_ADMIN my-network-tool
This is powerful because it minimizes the attack surface. If an attacker compromises a process running with only CAP_NET_BIND_SERVICE, they can’t remount the filesystem or create new user accounts. They are confined to the specific privilege granted.
A common pitfall is to think you can simply drop all capabilities and then add back only what’s needed. The default set of capabilities for a process launched by Docker is already reduced. You start with a baseline and add to it. To explicitly remove all capabilities and start from scratch (which is rarely necessary), you would use --cap-drop ALL --cap-add SPECIFIC_CAP.
The next concept you’ll run into is managing these capabilities in Kubernetes, where they are configured via securityContext.capabilities.