eBPF lets you run sandboxed programs in the kernel without changing kernel source or loading modules.

Here’s how we can use eBPF to trace kernel functions for performance analysis.

Let’s trace the sys_enter and sys_exit probes for the read syscall. This will show us how long each read call takes and what arguments are passed.

First, ensure you have bpftrace installed. On Ubuntu/Debian:

sudo apt-get update
sudo apt-get install bpftrace

Now, let’s run the bpftrace script:

sudo bpftrace -e 'tracepoints:syscalls:sys_enter_read /pid == 12345/ { @start[tid] = nsecs; } tracepoints:syscalls:sys_exit_read /pid == 12345/ { $duration = nsecs - @start[tid]; printf("PID %d read took %lld ns\n", pid, $duration); }'

In this command:

  • -e indicates we’re providing the script directly.
  • tracepoints:syscalls:sys_enter_read attaches to the kernel’s sys_enter_read tracepoint, which fires just as a read syscall begins.
  • /pid == 12345/ is a filter. Replace 12345 with the actual PID of the process you want to inspect. This is crucial to avoid overwhelming output.
  • { @start[tid] = nsecs; } records the current nanosecond timestamp (nsecs) in a map keyed by the thread ID (tid). This marks the start time of the read call.
  • tracepoints:syscalls:sys_exit_read attaches to the sys_exit_read tracepoint, which fires when the read syscall finishes.
  • { $duration = nsecs - @start[tid]; printf("PID %d read took %lld ns\n", pid, $duration); } calculates the duration by subtracting the start time from the current time. It then prints the PID and the duration in nanoseconds.

The real power comes from aggregating this data. Instead of just printing, we can create histograms of the durations:

sudo bpftrace -e 'tracepoints:syscalls:sys_enter_read /pid == 12345/ { @start[tid] = nsecs; } tracepoints:syscalls:sys_exit_read /pid == 12345/ { $duration = nsecs - @start[tid]; histogram("@read_durations", $duration); }'

This script will build a histogram of read call durations. When you press Ctrl+C, bpftrace will display the histogram, showing you the distribution of latencies.

Let’s break down the internal workings. bpftrace compiles your script into an eBPF program. This program is then loaded into the kernel. The kernel executes this eBPF program in a sandboxed environment whenever the specified tracepoint is hit. The kernel’s eBPF verifier ensures the program is safe and won’t crash the system. Data collected by the eBPF program is then passed back to the bpftrace userspace tool for display.

The tracepoints are static probes inserted by the kernel itself at specific, well-defined points in kernel execution, making them reliable and stable across kernel versions. kprobes (kernel probes) and uprobes (user-space probes) offer more flexibility to dynamically instrument almost any kernel function or user-space function, respectively, but require more care in selecting your probes.

Consider tracing kernel functions related to network packet processing. Instead of just observing read or write, you might want to trace tcp_v4_rcv or ip_rcv. This requires understanding the kernel’s network stack.

sudo bpftrace -e 'kprobe:tcp_v4_rcv /pid == 12345/ { @tcp_rcv_count = count(); }'

This simple example increments a counter (@tcp_rcv_count) every time the tcp_v4_rcv function is entered for a specific PID. This gives you a raw count of how many times that function is invoked. You can then expand this to measure latency or analyze arguments passed to these functions.

The real magic of eBPF lies in its ability to correlate events across different parts of the system. For instance, you could trace a read syscall and, within the same eBPF program, also trace the underlying block I/O operations that read triggers in the kernel’s storage stack. This allows you to see the full path of a request from user space down to the hardware and back, pinpointing bottlenecks that might otherwise be invisible.

What most people don’t realize is that eBPF programs can maintain state across probes using maps. These maps are key-value stores accessible from both the eBPF program in the kernel and the bpftrace tool in userspace. This allows you to build complex aggregations, like tracking the number of active connections for a specific IP address or calculating the average data transferred per socket over time. The nsecs variable used earlier is a built-in eBPF variable representing the current time in nanoseconds, but you can define your own maps to store arbitrary data, such as timestamps, counters, or even small data structures.

The next step is to explore kprobes and uprobes for deeper instrumentation, and to learn how to use BCC (BPF Compiler Collection) for more complex eBPF applications.

Want structured learning?

Take the full Ebpf course →