Kernel functions are the bedrock of the operating system, and understanding how they behave is crucial for debugging, performance tuning, and security. eBPF, coupled with kprobes, offers a powerful and safe way to observe and even modify this kernel behavior without recompiling the kernel or loading kernel modules.
Let’s see this in action. Imagine we want to know every time a new process is created on the system. We can attach an eBPF program to the __x64_sys_clone kernel function, which is the underlying implementation for the clone() system call (and by extension, fork() and execve()).
Here’s a simplified Python script using bcc (BPF Compiler Collection) to achieve this:
from bcc import BPF
# eBPF program to be attached to the kprobe
bpf_text = """
#include <uapi/linux/ptrace.h>
int trace_clone(struct pt_regs *ctx) {
// Get the PID of the parent process
u64 pid = bpf_get_current_pid_tgid() >> 32;
// Get the PID of the newly created process (from the return value of clone)
u64 new_pid = PT_REGS_RC(ctx);
// Print a message to user space
bpf_trace_printk("New process created: PID %d (from parent %d)\\n", new_pid, pid);
return 0;
}
"""
# Load the eBPF program
b = BPF(text=bpf_text)
# Attach the eBPF program to the __x64_sys_clone kprobe (entry point)
# We are using the function name directly. For specific architectures,
# the name might vary slightly, e.g., __arm64_sys_clone.
b.attach_kprobe(event="__x64_sys_clone", fn_name="trace_clone")
print("Attaching eBPF program to __x64_sys_clone. Press Ctrl+C to stop.")
# Read and print output from the BPF trace pipe
try:
while True:
# bpf.trace_print() reads from /sys/kernel/debug/tracing/trace_pipe
# and prints lines matching the bpf_trace_printk format.
(task, pid, cpu, offset, fn, ts, msg) = b.trace_fields()
print(f"{ts:.3f} {msg.decode().strip()}")
except KeyboardInterrupt:
print("Detaching kprobe and exiting.")
# Detach the kprobe when the script exits
b.detach_kprobe(event="__x64_sys_clone")
When you run this script (as root), every time a new process is spawned (e.g., by running sleep 1 in another terminal), you’ll see output like:
Attaching eBPF program to __x64_sys_clone. Press Ctrl+C to stop.
0.123456 New process created: PID 12345 (from parent 6789)
0.789012 New process created: PID 12346 (from parent 12345)
This demonstrates the core idea: you write a C-like eBPF program, compile and load it using bcc, and then attach it to a specific kernel function using attach_kprobe. The trace_fields() function then allows us to read the output generated by bpf_trace_printk in user space.
The Kprobe Mechanism
Kprobes are a dynamic instrumentation facility in the Linux kernel. They allow you to insert probe points at arbitrary kernel functions. When a function with an attached kprobe is entered (a kprobe) or exited (a kretprobe), the kernel executes a registered handler function. This handler is where our eBPF program comes in.
The beauty of eBPF here is that it runs in a sandboxed virtual machine within the kernel. Before an eBPF program can be attached, the kernel’s verifier checks it for safety: it ensures the program will terminate, won’t crash the kernel, and can only access memory it’s allowed to. This safety guarantee is what makes eBPF so revolutionary compared to traditional kernel modules.
When you call b.attach_kprobe(event="__x64_sys_clone", fn_name="trace_clone"), bcc does several things:
- It compiles the
bpf_textinto eBPF bytecode. - It loads this bytecode into the kernel.
- It registers a kprobe on the
__x64_sys_clonekernel function. - It tells the kernel to execute our
trace_cloneeBPF function whenever the kprobe is hit.
The struct pt_regs *ctx argument passed to our eBPF function is a pointer to the processor’s registers at the time the probe hit. This provides access to function arguments, return values, and other crucial context. PT_REGS_RC(ctx) is a macro provided by bcc to safely extract the return value of the probed function.
What Problem Does This Solve?
Traditionally, observing kernel behavior required either:
- SystemTap: A powerful scripting language for kernel tracing, but it requires kernel debug symbols and can have performance overhead.
- Kernel Modules: Writing and loading custom C code directly into the kernel. This is risky, as a bug can crash the entire system, and it requires recompiling for different kernel versions.
ftrace: The kernel’s built-in tracing framework, which is powerful but often requires complex configuration and can be difficult to script.
eBPF with kprobes provides a safe, efficient, and programmable way to do this. You can monitor system calls, network events, scheduler activity, and much more, with minimal overhead.
The Internal Levers
The struct pt_regs is your primary interface to the kernel’s state. It contains fields like regs[0] through regs[15] (on x86-64) which hold the general-purpose registers. For function arguments, you’d typically look at regs[0], regs[1], etc., in order. The return value is often found in regs[0] after the function returns (if using a kretprobe).
The bpf_trace_printk function is a simple helper for debugging, writing directly to /sys/kernel/debug/tracing/trace_pipe. For more sophisticated data collection, you’d use eBPF maps (like hash maps or arrays) to pass data efficiently between your eBPF program and a user-space application.
You can also attach eBPF programs to return from a function using attach_kretprobe. This is useful for capturing the return value of a function. For example, to see the return value of __x64_sys_clone (which is the new process ID on success), you would use attach_kretprobe.
The Counterintuitive Part
Many people assume that attaching probes to kernel functions means you’re directly modifying the kernel’s execution flow in a significant way. While kprobes do involve a small detour in execution, the eBPF verifier and the kernel’s kprobe infrastructure are designed to make this detour as minimal and safe as possible. The eBPF program itself is executed in a highly restricted environment, and its actions are limited to reading kernel state or sending data to user space. You’re not re-writing the kernel’s logic; you’re simply observing it at specific points.
The next step is often to learn how to use eBPF maps to collect more structured data, rather than just printing messages to a trace pipe.