eBPF traffic shaping lets you sculpt network packet flow at the kernel level, not just at the application or firewall layer.

Imagine you’ve got a critical microservice that needs guaranteed bandwidth, but it’s constantly being starved by less important background jobs. You could try to limit the background jobs at the application level, but that’s a game of whack-a-mole. Or, you could try to prioritize the critical service, but that often involves complex firewall rules that are hard to manage. eBPF traffic shaping lets you define precise rules for packet handling before they even hit the network stack, giving you fine-grained control.

Let’s see it in action. Suppose we want to limit the egress bandwidth of a specific process, say a bulk data transfer tool running as user datauser with PID 12345, to 10 megabits per second.

First, we need to load an eBPF program. This program will attach to the tc (traffic control) ingress or egress hook. For egress shaping, we’ll use the cls_bpf classifier.

Here’s a simplified BPF program written in C, which we’ll compile to BPF bytecode:

#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>

// Define a map for rate limiting, using token bucket algorithm
struct {
    __uint(type, BPF_MAP_TYPE_TOKEN_BUCKET);
    __uint(max_entries, 1);
    __uint(key_size, 0);
    __uint(value_size, sizeof(struct bpf_token_bucket));
} rate_limit_map SEC(".maps");

SEC("classifier")
int tc_egress_shaper(struct __sk_buff *skb) {
    // Get packet metadata
    u32 pid = bpf_get_current_pid_tgid() >> 32; // Get thread group ID

    // Our target PID is 12345
    if (pid != 12345) {
        return TC_ACT_OK; // Allow packets from other PIDs
    }

    // Initialize token bucket if not already done (this is simplified,
    // in reality you'd manage this map more carefully)
    // For demonstration, assume it's initialized with 10Mbps rate and 1MB bucket size.
    // bpf_token_bucket_create(&rate_limit_map, 10000000, 1000000); // 10Mbps, 1MB

    // Try to consume a token. If successful, the packet is allowed.
    // If not, the packet is dropped (or you could send it to another qdisc).
    // The token bucket size is in bytes. 10Mbps is ~1.25MB/s.
    // Let's say we want to limit to 10Mbps.
    // The `bpf_token_bucket_consume` function returns 0 on success, non-zero on failure.
    // For simplicity, let's assume bpf_token_bucket_consume is available and works as expected.
    // In a real scenario, you'd use bpf_skb_change_tc_ingress_ifindex for egress.
    // And manage the token bucket initialization.

    // A more robust approach would involve checking the packet size and
    // calling bpf_token_bucket_consume with the packet size in bytes.
    // For this example, we'll simulate a simple pass/fail based on token availability.
    // If `bpf_token_bucket_consume` returns 0, it means a token was available.
    // If it returns non-zero, no token was available.

    // Placeholder for token bucket logic.
    // In a real scenario, you'd do something like:
    // if (bpf_token_bucket_consume(&rate_limit_map, skb->len) == 0) {
    //    return TC_ACT_OK; // Token consumed, packet allowed
    // } else {
    //    return TC_ACT_SHOT; // No token, drop packet
    // }

    // For this simplified example, we'll just return OK, as the full BPF
    // token bucket API usage is complex and requires careful setup.
    // The principle is to check if enough tokens exist for the packet's size.
    return TC_ACT_OK;
}

To apply this, you’d compile it using clang and then load it using tc and ip:

  1. Compile the BPF program:

    clang -target bpf -O2 -c tc_shaper.c -o tc_shaper.o
    
  2. Load the classifier into the network interface: Let’s say your interface is eth0. You’d need to add a cls_bpf classifier to the egress qdisc (queueing discipline).

    First, ensure you have a ingress or egress qdisc. If eth0 has a default htb (hierarchical token bucket) qdisc, you might need to modify it or add a new one. For simplicity, let’s assume we’re adding a cls_bpf to an existing ingress qdisc.

    # Add the BPF classifier to the egress qdisc of eth0
    # This command adds the compiled BPF object as a filter.
    # The 'handle 1:' is an arbitrary identifier for this filter.
    # 'parent ffff:' means it applies to the root qdisc of the ingress path.
    # 'classid 1:1' is a class identifier within the qdisc.
    tc filter add dev eth0 protocol ip parent ffff: prio 1 bpf obj tc_shaper.o section classifier
    

    Note: The actual configuration for token bucket initialization and PID matching within the BPF program can be more involved. This example focuses on the attachment mechanism.

The core idea here is that the eBPF program intercepts packets at the kernel’s network ingress/egress point. It inspects packet metadata (like the process ID), checks against rate-limiting rules defined in BPF maps (like a token bucket), and then decides whether to let the packet pass (TC_ACT_OK), drop it (TC_ACT_SHOT), or redirect it.

The surprising truth is that eBPF doesn’t just observe network traffic; it actively rewrites the rules of how packets are processed within the kernel’s networking stack itself, without requiring kernel module recompilation or patching. This allows for dynamic, application-aware network control.

The mental model is that you’re no longer just configuring firewalls or setting global bandwidth limits. You’re writing tiny programs that run directly on network events. These programs can inspect packet contents, query kernel state (like process IDs, socket information), and make immediate decisions. These decisions are then executed by the kernel’s networking subsystem. The tc subsystem acts as the framework for attaching these eBPF programs to specific points in the network path (like ingress or egress on an interface).

The bpf_token_bucket map type is crucial. It provides a kernel-native implementation of the token bucket algorithm. When a packet arrives, the eBPF program calls bpf_token_bucket_consume on this map. This function attempts to "take" a number of tokens equivalent to the packet’s size. If there are enough tokens, the function returns 0, and the packet is allowed to proceed. If not, it returns a non-zero value, and the eBPF program can then decide to drop the packet (TC_ACT_SHOT). The bucket is replenished over time based on the configured rate.

What most people miss is how deeply integrated eBPF becomes with the kernel’s network scheduler. It’s not just an external agent; it’s a programmable component of the scheduler itself. You can, for instance, create multiple eBPF programs, each attached to a different class in a hierarchical queueing system, and have them interact. You could have one eBPF program that identifies high-priority traffic (e.g., SSH) and another that limits low-priority traffic, with both programs making decisions that the kernel’s tc scheduler then orchestrates.

The next frontier is using eBPF for more sophisticated network security policies, like detecting and mitigating specific types of application-layer attacks by analyzing packet flows in real-time.

Want structured learning?

Take the full Ebpf course →