eBPF can trace system calls across processes, but it’s not actually tracing the kernel’s system call mechanism directly; it’s hooking into the events that the kernel exposes, which happen to be system calls.
Let’s see this in action. Imagine we want to see every read system call made by a specific process, say nginx (PID 12345).
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_read /pid == 12345/ { printf("PID %d read %d bytes from fd %d\n", pid, args->count, args->fd); }'
This command uses bpftrace, a high-level tracing language for eBPF.
tracepoint:syscalls:sys_enter_read: This tells eBPF to attach to thesys_enter_readtracepoint. Tracepoints are special, low-overhead markers in the kernel that signal the entry into a specific kernel function, in this case, the entry into thereadsystem call handler./pid == 12345/: This is a BPF filter. It ensures that the subsequent action (theprintfstatement) only executes if the process ID (PID) of the process making the system call is12345.{ printf("PID %d read %d bytes from fd %d\n", pid, args->count, args->fd); }: This is the action to perform when the tracepoint is hit and the filter passes.pidis a built-in bpftrace variable for the current process ID.args->countandargs->fdare arguments passed to thesys_enter_readtracepoint, representing the number of bytes to read and the file descriptor, respectively.
When you run this, you’ll see output like:
PID 12345 read 4096 bytes from fd 3
PID 12345 read 1024 bytes from fd 5
PID 12345 read 8192 bytes from fd 3
This is incredibly powerful because you’re not modifying the kernel or relying on slow, high-overhead debugging tools. eBPF programs run in a sandboxed environment within the kernel, and they are verified for safety before execution.
The core problem eBPF solves here is the need for fine-grained, low-overhead visibility into kernel behavior, specifically system call activity, without requiring kernel recompilation or invasive debugging agents. Traditional methods like strace work by intercepting system calls in user-space, which involves context switches between user and kernel mode for every system call, leading to significant performance overhead. eBPF, by running code directly in the kernel at specific probe points (like tracepoints), minimizes this overhead.
The mental model for this is that you’re not adding new tracing logic to the kernel; you’re leveraging existing, safe kernel hooks (tracepoints, kprobes, uprobes) and injecting small, verified eBPF programs that execute within the kernel at those hooks. These programs can then inspect kernel data structures (like the arguments to a system call) and emit events or aggregate data.
The tracepoint subsystem is a collection of static markers placed by kernel developers at key points in kernel execution paths. When an event occurs at a tracepoint, the kernel can be instructed to run an eBPF program. This is distinct from kprobes, which allow you to dynamically probe arbitrary kernel functions (though they can be more fragile as function signatures might change between kernel versions). syscalls:sys_enter_* tracepoints are specifically designed to signal the entry into a system call.
You can also capture arguments for any system call. For instance, to see write calls:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write /pid == 12345/ { printf("PID %d writing %d bytes to fd %d\n", pid, args->count, args->fd); }'
And to see all system calls for a process:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* /pid == 12345/ { printf("PID %d called %s with %d args\n", pid, comm, nargs); }'
Here, comm is the command name and nargs is the number of arguments to the system call. The * is a wildcard matching all sys_enter_* tracepoints.
The key to understanding how args->count and args->fd work is that the kernel exports specific data structures for each tracepoint, and eBPF programs can access these. The bpftrace language provides a convenient abstraction over these raw structures. For example, args is a map where keys are the names of arguments for that specific tracepoint.
The most surprising thing is that you can even modify certain kernel behavior through eBPF, not just observe it. For example, you could potentially use an eBPF program attached to sys_enter_sendto to modify the data being sent before it leaves the kernel, or to drop packets entirely. This is controlled by the eBPF verifier and the specific eBPF helper functions available, which restrict what your program can do to maintain system stability.
Once you’ve mastered tracing specific system calls, the next logical step is to start correlating these system call events with network activity or disk I/O, which often go hand-in-hand.