Running containers within containerd doesn’t automatically mean they’re isolated from each other or the host. By default, containers share kernel resources and can potentially access sensitive host information. To achieve stronger isolation, you need to leverage technologies like Kata Containers or gVisor, which provide a sandboxed environment.

Let’s see gVisor in action. Imagine you have a simple web server running in a container. Normally, it would have direct access to the host’s network stack and file system.

# This is a standard container, no special isolation
docker run -d -p 8080:80 nginx

Now, let’s run the same nginx container, but this time, gVisor is intercepting its system calls.

# Using gVisor to run the nginx container
docker run --runtime=runc --runtime=nvidia --gpus all -d -p 8080:80 nginx

gVisor (or Kata Containers) acts as a middleman. Instead of the container’s processes making direct system calls to the host kernel, they go through gVisor’s user-space kernel. gVisor then translates or emulates these calls, presenting a restricted view of the system to the container. This is why it’s often called a "sandbox."

The core problem this solves is the shared kernel vulnerability. If a container exploits a kernel vulnerability, it can potentially gain elevated privileges on the host. By isolating the container’s system calls with gVisor or Kata Containers, you create a much larger attack surface for an attacker to breach. The sandbox effectively becomes a second kernel that needs to be compromised, significantly increasing the security boundary.

Here’s how it works under the hood with gVisor:

  1. System Call Interception: gVisor uses a modified runc (the default OCI runtime for Docker and Kubernetes) or can be configured directly in containerd. When a process inside the container tries to make a system call (e.g., open(), read(), write(), socket()), gVisor intercepts it.
  2. User-Space Kernel: gVisor has its own implementation of many common Linux system calls within its Go-based "Sentry" process. This Sentry process runs in user space.
  3. Call Emulation/Translation: gVisor’s Sentry either emulates the system call (e.g., for file system operations, it might map container paths to host paths securely) or directly performs the operation on behalf of the container. For calls that are too complex or dangerous to emulate, gVisor might deny them entirely.
  4. Limited Host Access: The container only sees the world as gVisor presents it. This means it has a virtualized file system, a virtualized network stack, and restricted access to host devices.

The key levers you control are:

  • Runtime Configuration: You specify which runtime to use. For containerd, this is done in the config.toml file, often under plugins.cri.containerd.runtimes. For Docker, it’s the --runtime flag.
  • Seccomp Profiles: While gVisor and Kata Containers provide their own strong isolation, you can layer seccomp (Secure Computing Mode) on top for even finer-grained control over allowed system calls.
  • Network and Storage Isolation: Even with sandboxing, you still configure network policies and storage volumes. The sandbox adds a layer of protection around these configurations.

What most people don’t realize is that gVisor doesn’t emulate the entire Linux kernel. It focuses on the most commonly used system calls and those most relevant to application execution. This selective emulation is what allows it to achieve good performance while still providing substantial security benefits. It means some esoteric or highly privileged system calls might not be supported, and applications relying on them could fail or behave unexpectedly.

The next step is to understand how to integrate these sandboxed runtimes seamlessly into Kubernetes deployments.

Want structured learning?

Take the full Containerd course →