containerd’s default security posture is surprisingly permissive, often leaving your containers vulnerable to host compromise through unconstrained syscalls or privileged access.
Let’s see what containerd is actually doing on a running host.
# On your host machine, find a running containerd process
ps aux | grep containerd
# Then, examine its namespaces and capabilities. We'll use `nsenter` to get inside the container's PID namespace.
# First, find the PID of a containerd child process (the container runtime itself)
CONTAINERD_PID=$(ps -ef | grep containerd | grep -v grep | awk '{print $2}' | head -n 1)
# Now, find a specific container's PID (e.g., a simple busybox container)
CONTAINER_PID=$(ps -ef | grep busybox | grep -v grep | awk '{print $2}' | head -n 1)
# Enter the container's PID namespace and list its syscall filters
sudo nsenter -t $CONTAINER_PID -n strace -e trace=open,execve,socket,connect,bind,listen,accept,mount,unmount,chmod,chown,setuid,setgid,ptrace,kill,reboot,syslog,klogctl,pivot_root,chroot,uselib,personality,setns,unshare,clone,fork,vfork,execveat,capset,capget,seccomp -f
# You'll likely see a lot of syscalls allowed by default.
The primary goal of hardening containerd is to drastically reduce the attack surface by limiting what a container process can do on the host. This involves two main mechanisms: Seccomp and AppArmor.
Seccomp: The Syscall Firewall
Seccomp (Secure Computing Mode) is a Linux kernel feature that allows a process to restrict the set of system calls it can make. containerd, by default, applies a moderately restrictive Seccomp profile, but it’s far from locked down. For production, you want to apply a much stricter profile.
The Problem: A compromised container might try to escape its sandbox by making malicious system calls, like mount to access host filesystems, pivot_root to change the root directory, or unshare to create new namespaces and gain more privileges.
Diagnosis: You can inspect the default Seccomp profile containerd uses. This is typically found in the containerd configuration file, often at /etc/containerd/config.toml. Look for the seccomp_profile setting under plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options. If it’s not explicitly set, containerd uses a built-in default.
Common Causes & Fixes:
-
Default Profile Too Permissive: The built-in default profile allows many syscalls that are not necessary for most containerized applications.
- Diagnosis: Examine the default profile. You can often find it in the
containerdsource code or by inspecting a running container’s Seccomp filters (though this is complex). - Fix: Download a stricter, community-maintained Seccomp profile. A good starting point is the one provided by Docker or Kubernetes. For example, you can fetch the
default.jsonprofile frommoby/moby:
Then, update yourwget https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json -O /etc/containerd/seccomp_profile.jsoncontainerdconfiguration:
Restart# /etc/containerd/config.toml [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true # Set the path to your downloaded strict profile SeccompProfile = "/etc/containerd/seccomp_profile.json"containerd:sudo systemctl restart containerd. - Why it Works: This profile explicitly denies all syscalls except for a curated list deemed safe and necessary for general container operation (like
read,write,openat,close,futex, etc.), drastically reducing the kernel attack surface.
- Diagnosis: Examine the default profile. You can often find it in the
-
Application-Specific Syscall Needs: Even strict profiles might block syscalls required by your specific application (e.g., network-intensive apps needing
socketorbind, or apps usingptracefor debugging).- Diagnosis: Run your application with the strict profile and observe
containerdlogs ordmesgfor Seccomp denial messages. You can also temporarily enable a more verbose Seccomp logging mode if available or usestracewithin the container (if allowed by the profile) to see what syscalls are being attempted. - Fix: Manually edit your
seccomp_profile.jsonto add the specific syscalls your application needs. For instance, to allowsocketandbind:
Important: Be very judicious here. Only add what’s absolutely necessary. Restart// Inside the "syscalls" array in your seccomp_profile.json { "names": ["socket", "bind"], "action": "SCMP_ACT_ALLOW" },containerdafter modifying the profile. - Why it Works: You’re creating a custom, fine-grained allowlist for your application’s unique requirements, balancing security with functionality.
- Diagnosis: Run your application with the strict profile and observe
-
Incorrect Profile Path:
containerdcan’t find or load the specified Seccomp profile.- Diagnosis: Check
containerdlogs (sudo journalctl -u containerd -f) for errors related to Seccomp loading or file access. Ensure the path specified inconfig.tomlis correct and that thecontainerduser has read permissions for the file. - Fix: Correct the
SeccompProfilepath in/etc/containerd/config.tomland ensure permissions are set:
Then restartsudo chown root:root /etc/containerd/seccomp_profile.json sudo chmod 644 /etc/containerd/seccomp_profile.jsoncontainerd. - Why it Works:
containerdcan now correctly locate and parse the Seccomp policy, applying the intended restrictions.
- Diagnosis: Check
AppArmor: The Process Confinement System
AppArmor is another Linux security module that confines programs to a predetermined set of resources. It works by defining profiles for specific executables, dictating what files they can access, what network operations they can perform, and more. containerd can leverage AppArmor to further restrict the runtime environment.
The Problem: Even with Seccomp limiting syscalls, a container might still have broad access to the host filesystem, network interfaces, or sensitive host processes if its user context or the container runtime itself is compromised.
Diagnosis: Check if AppArmor is enabled on your host (sudo aa-status). Look for containerd or runc profiles in the output. You can also check the containerd configuration (/etc/containerd/config.toml) for AppArmor-related settings, often under plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options.
Common Causes & Fixes:
-
AppArmor Not Enabled or Profiles Missing: AppArmor is disabled on the host, or the necessary
containerd/runcprofiles aren’t installed or loaded.- Diagnosis: Run
sudo aa-status. If AppArmor is not running, you’ll need to enable it (OS-dependent, often involves kernel boot parameters likeapparmor=1 security=apparmor). Check if profiles likecontainerd-runcordocker-defaultexist in/etc/apparmor.d/. - Fix: Ensure AppArmor is enabled. Install AppArmor utilities (
sudo apt install apparmor apparmor-utilsorsudo yum install apparmor apparmor-utils). Install or ensure profiles are present. Many distributions include them. If not, you might find them incontainerd’s or Docker’s source. Load them:
Ensure yoursudo apparmor_parser -r -W /etc/apparmor.d/containerd-runc # Or for Docker's profile if using that sudo apparmor_parser -r -W /etc/apparmor.d/docker-defaultcontainerdconfig enables AppArmor. Inconfig.toml:
Restart[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true # Ensure this is set to true to enable AppArmor ApparmorEnabled = truecontainerd:sudo systemctl restart containerd. - Why it Works: AppArmor is now active on the host, and
containerdis configured to apply its security profiles to container runtimes, confining them.
- Diagnosis: Run
-
Application Violating AppArmor Profile: Your application inside the container is trying to perform an action disallowed by the AppArmor profile (e.g., writing to a protected file, accessing a network port it shouldn’t).
- Diagnosis: Check
dmesgorsyslogfor AppArmor denial messages. These typically look likeDENIEDfollowed by the rule that was violated. - Fix: This is the most complex. You have two main options:
- Adjust the AppArmor Profile: If the action is legitimate, you can use
aa-logprofto learn from the denials and update the AppArmor profile in/etc/apparmor.d/. This requires understanding AppArmor syntax. - Use a Different Profile: If
containerdis using a generic profile (likedocker-default), it might be too restrictive. You can try to get a more permissive profile or, for specific workloads, create a custom AppArmor profile for your container image. - Disable AppArmor for Specific Containers (Last Resort): You can disable AppArmor for a specific container using the
--security-opt apparmor=flag when running a container withdocker runor equivalent configuration in Kubernetes/containerd’s API. Forcontainerddirectly, this might involve omitting the AppArmor profile in the container’s runtime options.
- Adjust the AppArmor Profile: If the action is legitimate, you can use
- Why it Works: Either the AppArmor rules are updated to permit the necessary actions, or the restriction is selectively lifted for workloads that cannot be confined.
- Diagnosis: Check
-
Container Runtime Not Using AppArmor:
containerdis configured to use AppArmor, but the underlying runtime (likerunc) isn’t correctly configured or is bypassed.- Diagnosis: Ensure
ApparmorEnabled = trueincontainerd’s config. Verify that theruncexecutable is properly installed and thatcontainerdis calling it with the correct AppArmor flags. - Fix: This is usually tied to the
ApparmorEnabledsetting and ensuring the AppArmor profiles are correctly parsed and loaded by the kernel. Ifruncitself is misconfigured or not correctly communicating with the kernel’s LSM (Linux Security Module) framework, it might require reinstallation or specific kernel module loading. - Why it Works:
containerdcorrectly instructsrunc(or its chosen runtime) to apply the AppArmor confinement when launching the container process.
- Diagnosis: Ensure
Rootless Mode: Isolating from the Host User
Rootless mode allows containerd and containers to run as a non-root user on the host. This is a significant security enhancement because it means a compromise within the container (or even containerd itself) doesn’t automatically grant root access to the host.
The Problem: By default, containerd runs as a system daemon, requiring root privileges. If the containerd process or a container it manages is compromised, the attacker gains root on the host.
Diagnosis: Check if containerd is running as root.
ps aux | grep containerd
If the USER column shows root, it’s running in rootful mode.
Common Causes & Fixes:
-
Rootless
containerdDaemon Not Running: You haven’t configured or started the rootlesscontainerdservice.- Diagnosis: Check if
containerdis running as root. If so, the rootless daemon is not active. - Fix: Follow the official
containerddocumentation for setting up rootless mode. This typically involves:- Installing
containerdas a regular user. - Configuring
~/.config/containerd/config.tomlfor rootless operation (e.g., settingrootless_mode = true). - Setting up appropriate user namespaces (
subuid/subgidmappings). - Starting the rootless
containerdservice usingsystemd --useror similar.
Then enable and start:# Example systemd user service file (e.g., ~/.config/systemd/user/containerd.service) [Unit] Description=containerd rootless After=network.target [Service] ExecStart=/usr/local/bin/containerd --config ~/.config/containerd/config.toml Restart=always [Install] WantedBy=default.targetsystemctl --user enable containerd.service systemctl --user start containerd.service - Installing
- Why it Works:
containerdnow operates entirely within the unprivileged user’s context, significantly limiting its ability to affect the host system.
- Diagnosis: Check if
-
Inadequate User Namespace Configuration: User namespace remapping (essential for rootless containers) is not correctly set up.
- Diagnosis: Rootless containers will fail to start, often with errors related to user ID mapping or permissions. Check
containerdlogs for errors like "failed to set up user namespace" or permission denied. - Fix: Ensure you have sufficient entries in
/etc/subuidand/etc/subgidfor the user runningcontainerd. For example, to map 65536 UIDs and GIDs starting from a high number:
Then restart the rootless# As the user running containerd sudo usermod --add-subuids 100000-165535 $USER sudo usermod --add-subgids 100000-165535 $USERcontainerdservice. - Why it Works: User namespaces allow the container to believe it’s running as root (UID 0) within its own isolated environment, while on the host, these UIDs are mapped to unprivileged user IDs, preventing true root escalation.
- Diagnosis: Rootless containers will fail to start, often with errors related to user ID mapping or permissions. Check
-
Networking Issues in Rootless Mode: Rootless networking often relies on user-mode networking stacks like
slirp4netns, which can have performance limitations or compatibility issues compared to rootful networking.- Diagnosis: Containers can’t reach external services, or external services can’t reach containers. Network performance is poor.
- Fix: Ensure
slirp4netns(or your chosen user-mode network stack) is installed and correctly configured. For more advanced networking, consider usingpastaorpodman networkfeatures if available in your rootless setup. Some advanced configurations might involve setting up VPNs or specific tunnel interfaces. - Why it Works: The user-mode network stack correctly bridges the container’s network namespace to the host’s network interface without requiring root privileges.
By implementing strict Seccomp profiles, leveraging AppArmor, and running containerd in rootless mode, you create a significantly more secure environment for your containerized workloads, minimizing the impact of potential vulnerabilities.
The next hurdle you’ll likely face is managing these security policies effectively across a fleet of containers and ensuring your CI/CD pipelines integrate with these hardening measures without becoming brittle.