Docker containers don’t actually isolate processes from the host kernel; they just give them their own view of it.

Let’s see this in action. Imagine you have a simple container running sleep 3600.

docker run -d --name sleepy ubuntu sleep 3600

Now, from your host machine, you can see this process. Not just as a Docker-managed PID, but as a regular Linux process.

ps aux | grep sleep

You’ll see something like:

root      12345  0.0  0.0   4320   740 pts/0    S+   10:30   0:00 sleep 3600

The 12345 is the PID on your host. This is where security starts to matter. If your container process can interact with the host kernel in unexpected ways, it’s a problem. That’s where Docker’s security features come in: capabilities, namespaces, and seccomp.

Capabilities: Granular Root Privileges

Traditionally, Linux processes are either root (UID 0) or non-root. Root can do anything. This is too much power for most containerized applications. Capabilities break down root’s power into smaller, distinct privileges. For example, CAP_NET_BIND_HIGH allows binding to ports above 1024, while CAP_NET_BIND_SERVICE allows binding to ports below 1024 (like 80 or 443).

By default, Docker drops most capabilities from containers. It only grants a minimal set that most applications need.

Common Cause: An application inside a container needs a specific privilege that Docker has dropped by default.

Diagnosis:

  1. Check the capabilities of a running process within the container:

    docker exec sleepy getpcaps 1
    

    (Replace 1 with the PID of the sleep process inside the container, which you can find with docker top sleepy). You’ll likely see a limited set of capabilities.

  2. If your application fails with a permission denied error related to a specific operation (e.g., binding to port 80), that’s a clue.

Fix: Add the specific capability using the --cap-add flag. For example, to allow binding to privileged ports:

docker run -d --name privileged_app --cap-add=NET_BIND_SERVICE your_image

This grants the NET_BIND_SERVICE capability to the container, allowing the process inside to bind to ports below 1024, while still dropping other powerful root privileges.

Why it works: Instead of giving the container full root access, you’re granting it only the specific, necessary privilege (NET_BIND_SERVICE) to perform the required network operation, minimizing the attack surface.

Common Cause: An application requires a capability that is still dropped even with the default set.

Diagnosis: Similar to above, use docker exec <container_name> getpcaps <pid_in_container> to see what’s available.

Fix: Add the missing capability. For instance, if your app needs to manipulate network interfaces:

docker run -d --name net_admin_app --cap-add=NET_ADMIN your_image

This adds the NET_ADMIN capability, allowing the container to perform various network-related administrative tasks.

Why it works: NET_ADMIN is not a default capability because it’s powerful. Adding it explicitly allows the containerized application to manage network interfaces, routing tables, etc., without granting it every other root privilege.

Common Cause: You’ve accidentally granted too many capabilities.

Diagnosis: Review the output of docker inspect <container_name> and look for the CapAdd and CapDrop sections.

Fix: Explicitly drop unnecessary capabilities using --cap-drop.

docker run -d --name minimal_app --cap-drop=NET_RAW --cap-drop=SYS_CHROOT your_image

This removes NET_RAW and SYS_CHROOT capabilities from the container, even if they were granted by default or added via --cap-add earlier in the command line.

Why it works: By default, Docker drops all capabilities except a safe subset. Explicitly dropping more capabilities further restricts what the container can do, adhering to the principle of least privilege.

Namespaces: Process Isolation

Namespaces are the fundamental Linux mechanism that give containers their isolated view. They make a process think it’s the only one running, or that it has its own network stack, or its own filesystem root. Docker uses several types:

  • PID Namespace: Isolates process IDs. A process inside a container will see PID 1 as its init process, not the host’s PID 1 (init/systemd).
  • Network Namespace: Gives the container its own network interfaces, IP addresses, routing tables, etc.
  • Mount Namespace: Isolates filesystem mount points.
  • UTS Namespace: Isolates hostname and domain name.
  • IPC Namespace: Isolates inter-process communication resources.
  • User Namespace: Isolates user and group IDs.

Common Cause: A container process needs to communicate with services on the host network or see host processes.

Diagnosis: If your containerized application is trying to bind to 0.0.0.0 and failing, or if it’s trying to ping host IPs and failing with "operation not permitted," it might be a network namespace issue.

Fix: Run the container with --network host.

docker run -d --name host_network_app --network host your_image

This tells Docker not to create a separate network namespace for the container. The container will share the host’s network stack.

Why it works: By using the host’s network namespace, the container process sees the host’s network interfaces, IP addresses, and can bind to any available port on the host directly, effectively bypassing the isolation provided by the network namespace.

Common Cause: A container needs to see the host’s process tree, or modify processes on the host.

Diagnosis: If a process inside the container attempts to interact with host processes (e.g., using ps to list host processes, or sending signals to them) and gets "permission denied," it’s likely due to PID namespace isolation.

Fix: Run the container with --pid host.

docker run -d --name host_pid_app --pid host your_image

This removes the PID namespace isolation, allowing the container to see and interact with all processes on the host.

Why it works: The container now shares the host’s PID namespace. Processes inside the container will see the actual PIDs of processes running on the host, and can potentially interact with them (subject to Linux user permissions).

Seccomp: Restricting System Calls

Seccomp (Secure Computing mode) is a Linux kernel feature that allows a process to define a filter for the system calls it’s allowed to make. Docker uses seccomp profiles to restrict what syscalls a container can execute. The default Docker profile blocks a number of dangerous syscalls.

Common Cause: An application inside a container is trying to make a system call that is blocked by Docker’s default seccomp profile.

Diagnosis:

  1. Check container logs for "Operation not permitted" errors, especially if they occur during application startup or when performing specific operations.
  2. If you suspect seccomp is the culprit, you can temporarily disable it to confirm:
    docker run -d --name debug_app --security-opt seccomp=unconfined your_image
    
    If the application works with seccomp=unconfined, then seccomp was indeed blocking a syscall.

Fix: Create a custom seccomp profile that explicitly allows the required syscall(s).

  1. Get the default profile:
    docker run --rm alpine cat /proc/1/status | grep Seccomp
    # This just shows seccomp is enabled. To get the profile:
    # You'll need to save the default profile from a running container or find it online.
    # Example: save it to default.json
    
  2. Edit default.json to add the necessary syscall. For example, if clone3 is blocked and needed:
    {
        "defaultAction": "SCMP_ACT_ERRNO",
        "architectures": [
            "SCMP_ARCH_X86_64",
            "SCMP_ARCH_X86",
            "SCMP_ARCH_AARCH64"
        ],
        "syscalls": [
            // ... existing syscalls ...
            {
                "names": ["clone3"],
                "action": "SCMP_ACT_ALLOW"
            }
            // ...
        ]
    }
    
  3. Apply the custom profile:
    docker run -d --name custom_seccomp_app --security-opt seccomp=/path/to/your/custom.json your_image
    

Why it works: Seccomp profiles define an allowlist or blocklist of system calls. By adding a specific syscall to an allowlist in a custom profile, you permit that single syscall to be executed by the container process, while all other potentially dangerous syscalls remain blocked by default or by the explicit rules in your profile.

Common Cause: You want to run an application that requires a very broad set of syscalls, and managing a custom profile is too complex.

Diagnosis: Application fails with various "Operation not permitted" errors across different functionalities.

Fix: Disable seccomp filtering for the container.

docker run -d --name no_seccomp_app --security-opt seccomp=unconfined your_image

This tells Docker not to apply any seccomp filtering to this container.

Why it works: Seccomp filtering is completely bypassed. The container’s processes can make any system call that the host kernel permits based on standard Linux user permissions, effectively removing this layer of security.

These three mechanisms—capabilities, namespaces, and seccomp—are the bedrock of Docker container security, working together to provide isolation and restrict the attack surface. The next logical step in hardening is often exploring runtime security tools like Falco or AppArmor/SELinux profiles.

Want structured learning?

Take the full Cdk course →