bpftrace can attach to almost any kernel probe point, making it a powerful tool for live kernel tracing and debugging.
Let’s see bpftrace in action. Imagine you’re troubleshooting a slow application and suspect it’s spending too much time in system calls. You can instantly see which syscalls are being invoked and how long they’re taking with a single command.
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[comm] = count(); }'
This command attaches to every sys_enter tracepoint. For each syscall, it increments a counter associated with the process name (comm). Running this for a few seconds and then pressing Ctrl+C will show you a distribution of syscalls by process.
Attaching 10 probes...
nginx: 15890
mysqld: 12345
bash: 8765
...
This gives you an immediate overview of what your system is doing at the kernel level.
The core idea behind bpftrace is its ability to leverage eBPF (extended Berkeley Packet Filter) programs, which run safely within the Linux kernel. bpftrace acts as a high-level scripting language that compiles down to eBPF bytecode. This means you’re not just observing, you’re running small, safe programs inside the kernel to gather specific data.
The power comes from the vast array of probe points available:
- Tracepoints: Static kernel markers for specific events (syscall entry/exit, scheduler events, etc.).
- Kprobes/Kretprobes: Dynamic probes that can attach to almost any kernel function.
- Uprobes/Uretprobes: Dynamic probes for user-space functions.
- Perf events: Hardware performance counters.
- Socket filters: For network traffic analysis.
Let’s say you want to see how many times a specific function, vfs_read, is called and how long it takes on average.
sudo bpftrace -e 'kprobe:vfs_read,kretprobe:vfs_read /pid == 1234/ { @[probe] = count(); nsecs[probe] = nsecs; } interval:s:5 { print(@nsecs) }'
This attaches to the entry and return of vfs_read for processes with PID 1234. It counts calls and records entry timestamps. Every 5 seconds, it prints the recorded nanoseconds, allowing you to calculate durations and throughput.
The syntax is designed for expressiveness and conciseness. You define probes, actions to take when probes fire (often involving aggregations like count(), sum(), avg(), hist()), and optional filters. Variables like comm (command name), pid (process ID), tid (thread ID), and nsecs (nanoseconds since boot) are built-in.
Consider tracing network packet drops. You can pinpoint which network interface is experiencing drops and how many.
sudo bpftrace -e 'tracepoint:net:net_rx_drop { @[args->interface] = count(); }'
This command attaches to the net_rx_drop tracepoint and aggregates the count of drops by the network interface name. The args->interface part accesses the interface field passed by the tracepoint.
The one thing that often surprises people is how much data you can gather without impacting system performance significantly. Because eBPF programs run in a sandboxed environment within the kernel and are highly optimized, they have minimal overhead. This allows for continuous, low-impact monitoring even on production systems. You can attach to thousands of probes, collect detailed timing information, and still see negligible performance degradation, which is a stark contrast to traditional tracing tools that often require kernel recompilation or introduce substantial latency.
What’s next is exploring how to correlate kernel events with user-space application behavior using uprobes and understanding how to use bpftrace’s scripting language for more complex data analysis and visualization.