Seccomp filters are a surprisingly effective way to lock down Linux containers, but they don’t actually prevent your container from trying to execute disallowed system calls; they just make the kernel immediately kill the process if it does.

Let’s see what this looks like in practice. Imagine a simple container that’s not supposed to be able to make any network calls. We’ll start with a basic Dockerfile:

FROM ubuntu:latest
RUN apt-get update && apt-get install -y --no-install-recommends netcat && rm -rf /var/lib/apt/lists/*
CMD ["nc", "-l", "-p", "8080"]

If we build and run this normally:

docker build -t netcat-test .
docker run -d --name netcat-container netcat-test

The container starts up, and nc begins listening on port 8080. Now, let’s try to make a system call that we want to block, like socket (which is used to create network connections). We can use strace to see the system calls a process makes. First, we need to get the container’s PID:


docker inspect --format '{{.State.Pid}}' netcat-container

Let’s say the PID is 12345. Now, we can attach strace to it:

sudo strace -p 12345

You’ll see a lot of output, but eventually, you’ll see something like this when nc tries to set up its listening socket:

...
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
...

Now, let’s apply a seccomp filter that denies the socket syscall. We’ll use the docker run command with the --security-opt flag. We need to create a JSON file for the seccomp profile. Let’s call it no-socket.json:

{
  "defaultAction": "allow",
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_ARM",
    "SCMP_ARCH_AARCH64"
  ],
  "syscalls": [
    {
      "names": [
        "socket",
        "socketpair",
        "bind",
        "connect",
        "accept",
        "accept4",
        "listen",
        "pipe",
        "pipe2",
        "eventfd",
        "eventfd2"
      ],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1
    }
  ]
}

This profile explicitly denies a common set of network-related syscalls by returning errno 1 (Operation not permitted). Now, let’s run our container with this profile:

docker run -d --name netcat-secure --security-opt seccomp=./no-socket.json netcat-test

Almost immediately, the container will stop. You can check its status:

docker ps -a

You’ll see netcat-container (the one without the filter) still running, but netcat-secure will be Exited (137). The 137 is a common exit code for processes killed by SIGKILL (9) due to an external signal, often related to OOM killer or, in this case, seccomp.

Let’s look at the container logs to see what happened:

docker logs netcat-secure

You won’t see anything useful here because the process was killed at the syscall level before it could even log. To see why it was killed, we need to look at the Docker daemon logs or the system journal. On systems using systemd, you can often find this with:

journalctl -u docker.service -f

You’ll see an entry indicating a container was killed due to seccomp. If we were to attach strace to the initial process that tried to run nc within the container before it was killed, we’d see the socket() call failing with EPERM (Operation not permitted). The kernel doesn’t just return an error; it terminates the offending process.

The core idea here is that seccomp works by defining a whitelist or blacklist of system calls. When a process makes a syscall, the kernel checks it against the active seccomp filter. If the syscall is on the blacklist (or not on the whitelist, depending on the profile), the kernel applies the defined action. SCMP_ACT_ERRNO is one of the most common actions, causing the syscall to immediately return a specified error code. However, for many critical syscalls that would lead to immediate container compromise (like socket for network access, or execve for running new programs), the kernel’s default behavior for a blocked syscall is to kill the process with SIGKILL.

This isn’t about preventing the attempt, but about punishing the attempt by termination. It’s a powerful defense-in-depth mechanism, ensuring that even if an application within the container has a vulnerability that could be exploited to trigger a disallowed syscall, the container will be killed before any damage can be done.

The default Docker seccomp profile is quite permissive, allowing most common syscalls. For hardened environments, you’ll want to create custom profiles that only permit what’s absolutely necessary. This involves understanding what your application actually needs to do at the syscall level. Tools like strace are invaluable for this, but remember that strace itself makes syscalls that might be blocked by your profile if you’re not careful.

The next challenge you’ll encounter is dealing with applications that require a surprisingly large number of syscalls, forcing you to build very complex seccomp profiles that are hard to maintain.

Want structured learning?

Take the full Cdk course →