The most surprising thing about eBPF map-in-map is that it allows you to create truly dynamic, per-CPU data structures without writing a single line of kernel C, and it’s significantly more performant than traditional approaches.
Imagine we’re building a high-performance network packet processing pipeline. We want to collect statistics per CPU core, specifically the number of packets processed and bytes transmitted. A naive approach might involve a global array of counters, protected by a spinlock. This quickly becomes a bottleneck as multiple CPUs contend for the lock.
eBPF offers a more elegant solution using map-in-map. The outer map is of type BPF_MAP_TYPE_PERCPU_ARRAY, and its values are themselves maps, specifically BPF_MAP_TYPE_HASH maps. This nested structure allows each CPU to have its own hash map, keyed by something like a flow ID or a port number, and the inner hash map stores the per-CPU counters (packets, bytes).
Let’s see this in action. Here’s a simplified eBPF program that increments counters based on incoming network packets.
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
// Outer map: per-CPU array where each element is a hash map
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1); // Only one inner map needed per CPU
__type(key, __u32); // Key for the outer map (always 0 for percpu-array)
__type(value, struct bpf_map *); // Value is a pointer to the inner map
} outer_map SEC(".maps");
// Inner map definition (defined dynamically)
// struct {
// __uint(type, BPF_MAP_TYPE_HASH);
// __uint(max_entries, 1024);
// __type(key, __u32); // e.g., flow ID or port
// __type(value, struct stats);
// } inner_map SEC(".maps");
struct stats {
__u64 packets;
__u64 bytes;
};
// Helper function to get the inner map for the current CPU
static inline struct bpf_map *get_inner_map() {
__u32 key = 0; // For BPF_MAP_TYPE_PERCPU_ARRAY, the key is always 0
return bpf_map_lookup_elem(&outer_map, &key);
}
SEC("xdp_pass")
int xdp_collect_stats(struct xdp_md *ctx) {
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) {
return XDP_PASS;
}
// For simplicity, let's use the destination port as the key
// In a real scenario, you'd extract a flow identifier
struct iphdr *iph = data + sizeof(struct ethhdr);
if ((void *)(iph + 1) > data_end) {
return XDP_PASS;
}
__u32 key = iph->dport; // Using destination port as key
struct bpf_map *inner_map = get_inner_map();
if (!inner_map) {
// Inner map not created yet, create it
// This part is typically handled by userspace when loading the eBPF program
return XDP_PASS;
}
struct stats *stats_entry = bpf_map_lookup_elem(inner_map, &key);
if (stats_entry) {
stats_entry->packets++;
stats_entry->bytes += ctx->len;
} else {
struct stats new_stats = {
.packets = 1,
.bytes = ctx->len,
};
int ret = bpf_map_update_elem(inner_map, &key, &new_stats, BPF_ANY);
if (ret != 0) {
// Handle error, though unlikely if map has space
}
}
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
The key to map-in-map is how the inner map is created and associated with the outer map. In a typical userspace loader (e.g., using libbpf), you would first create the inner BPF_MAP_TYPE_HASH map. Then, you’d create the outer BPF_MAP_TYPE_PERCPU_ARRAY map. Finally, you’d iterate through each CPU, and for each CPU, you’d update the outer map at index 0 (for BPF_MAP_TYPE_PERCPU_ARRAY) with a pointer to the same inner hash map. This effectively links the single hash map to every CPU’s per-CPU array entry.
from bcc import BPF
import os
# Define the BPF program in C
bpf_code = r"""
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
struct stats {
__u64 packets;
__u64 bytes;
};
// Outer map: per-CPU array where each element is a hash map
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, struct bpf_map *);
} outer_map SEC(".maps");
// Inner map definition (created by userspace)
// struct {
// __uint(type, BPF_MAP_TYPE_HASH);
// __uint(max_entries, 1024);
// __type(key, __u32); // e.g., flow ID or port
// __type(value, struct stats);
// } inner_map SEC(".maps");
// Helper function to get the inner map for the current CPU
static inline struct bpf_map *get_inner_map() {
__u32 key = 0;
return bpf_map_lookup_elem(&outer_map, &key);
}
SEC("xdp_pass")
int xdp_collect_stats(struct xdp_md *ctx) {
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) {
return XDP_PASS;
}
struct iphdr *iph = data + sizeof(struct ethhdr);
if ((void *)(iph + 1) > data_end) {
return XDP_PASS;
}
__u32 key = iph->dport; // Using destination port as key
struct bpf_map *inner_map = get_inner_map();
if (!inner_map) {
return XDP_PASS; // Inner map not ready
}
struct stats *stats_entry = bpf_map_lookup_elem(inner_map, &key);
if (stats_entry) {
stats_entry->packets++;
stats_entry->bytes += ctx->len;
} else {
struct stats new_stats = {
.packets = 1,
.bytes = ctx->len,
};
// BPF_ANY means create if not exists, update if exists
int ret = bpf_map_update_elem(inner_map, &key, &new_stats, BPF_ANY);
if (ret != 0) {
// Handle error (e.g., map is full)
}
}
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
"""
# Load BPF program
b = BPF(text=bpf_code)
# Create the inner hash map
inner_map_name = "inner_map"
inner_map_type = BPF_MAP_TYPE_HASH
inner_map_key_size = 4 # __u32
inner_map_value_size = 16 # struct stats (2 * __u64)
inner_map_max_entries = 1024
# Use bpf_map_create to create the map dynamically
# Note: This requires sufficient privileges and kernel support for dynamic map creation
try:
inner_map_fd = b.bpf_map_create(
name=inner_map_name,
map_type=inner_map_type,
key_size=inner_map_key_size,
value_size=inner_map_value_size,
max_entries=inner_map_max_entries,
map_flags=0
)
if inner_map_fd < 0:
raise OSError(f"Failed to create inner map '{inner_map_name}': {os.strerror(-inner_map_fd)}")
print(f"Created inner map '{inner_map_name}' with fd: {inner_map_fd}")
# Get the outer map
outer_map = b.get_map("outer_map")
# Link the inner map to the outer map for each CPU
# For BPF_MAP_TYPE_PERCPU_ARRAY, the outer map's value is a pointer to another map
# We need to update the outer map's entry for each CPU.
# The key for PERCPU_ARRAY is typically 0.
# The value for PERCPU_ARRAY is a pointer to the inner map.
# In libbpf, this linking is often done via bpf_map__set_inner_map or similar
# For bcc, we can simulate this by updating the outer map with the inner map's FD
# Note: The eBPF program expects a `struct bpf_map *`, which is an internal kernel structure.
# Directly passing an FD might not work as expected or might be an internal detail.
# A more robust approach for libbpf is to use `bpf_map_update_elem` with `BPF_ANY` and
# the inner map's file descriptor if the kernel supports it for map-in-map.
# However, the C code expects `struct bpf_map *`. BCC's `bpf_map_create` returns an FD.
# Let's try updating the outer map with the inner map's FD, assuming the kernel
# handles this conversion or the C code is adjusted.
# A more direct way using libbpf would be to pass the inner map object.
# For demonstration with bcc, we'll attempt to update the outer map.
# The correct way to link maps in libbpf is often implicit when loading
# or through specific helper functions. BCC's dynamic map creation
# and linking can be a bit more involved.
# The eBPF program expects a `struct bpf_map *`. Passing an FD might require
# a specific kernel mechanism or interpretation.
# A common pattern is to have userspace *create* the inner map and then
# *assign* it to the outer map's entry.
# Let's re-examine the C code: `bpf_map_lookup_elem(&outer_map, &key)` returns `struct bpf_map *`.
# This means the outer map's value *is* a pointer to the inner map.
# BCC's `bpf_map_create` returns an FD. The way to link them is to update the outer map
# with the inner map's FD. The kernel will then interpret this FD as a pointer to the map.
key_outer = 0 # Key for PERCPU_ARRAY is 0
# We need to pass the FD of the inner map to the outer map.
# The C code expects `struct bpf_map *`. BCC might handle this FD -> pointer conversion.
outer_map.update(key_outer, inner_map_fd)
print(f"Linked inner map FD {inner_map_fd} to outer map at key {key_outer}")
# Attach the XDP program to an interface (e.g., 'eth0')
# Replace 'eth0' with your actual network interface
interface = "eth0" # Or any interface you want to test with
print(f"Attaching XDP program to interface: {interface}")
b.attach_xdp(interface, b.get_func("xdp_collect_stats"))
print("XDP program attached. Sending traffic to interface...")
# Keep the script running to collect stats
# In a real application, you'd have a loop to periodically read stats
import time
try:
while True:
time.sleep(5)
print("\n--- Stats ---")
# Iterate over the inner map to retrieve stats per port
# We need to get the inner map object itself to iterate its contents
# Since the inner map is shared, we can access it via its FD
# bcc provides ways to access maps by FD, but it's often easier
# to get the map object from the BPF instance if it was defined in C.
# Since we created it dynamically, we need to access it via its FD.
# To iterate the inner map, we need to use the `bpf_map_get_next_key` and `bpf_map_lookup_elem`
# helpers from the kernel, or rely on libbpf/bcc abstractions.
# bcc's `BPF.get_table` can get a map object if defined in C.
# For dynamically created maps, we can use the FD directly with lower-level calls.
# Let's try to get the inner map object from the BPF instance if BCC registered it.
# If not, we'd need a more manual iteration using the FD.
# The C code uses `bpf_map_lookup_elem(inner_map, &key)`.
# We need to read from this `inner_map`.
# Let's assume `b.get_map("inner_map")` works if BCC registered it.
# If not, manual iteration using `bpf_map_get_next_key` is needed.
# For simplicity here, we'll try to read the outer map, which holds the inner map pointer.
# Then we'll use that pointer (FD) to read from the inner map.
# Reading from the outer map itself is not useful for stats, it holds map pointers.
# We need to read from the inner map.
# The inner map is represented by `inner_map_fd`.
# Iterating a hash map can be done by getting keys and then looking up values.
# BCC doesn't directly expose map iteration by FD in a high-level way like `table.items()`.
# We need to simulate `bpf_map_get_next_key` and `bpf_map_lookup_elem`.
# A simpler approach for demonstration: read the *total* stats from all CPUs
# for a specific key (e.g., port 80).
# To get per-CPU stats, we'd iterate `outer_map` entries (though it's just one here)
# and then iterate the inner map for each CPU.
# To get the actual stats, we need to iterate the inner map.
# Let's retrieve the inner map object from the BPF instance if it was named.
# If we created it via `bpf_map_create`, it might not be directly accessible
# via `b.get_map`.
# Let's simulate reading from the inner map using its FD.
# We'll iterate through keys.
key_ptr = b.get_table(inner_map_name).KeyPtr() # Get pointer to store keys
value_ptr = b.get_table(inner_map_name).LeafPtr() # Get pointer to store values
# Get the first key
next_key_ptr = b.call("bpf_map_get_next_key", inner_map_fd, None, key_ptr)
if next_key_ptr.value == -1: # No elements
print("No stats collected yet.")
else:
keys_found = 0
while True:
# Lookup the value for the current key
ret = b.call("bpf_map_lookup_elem", inner_map_fd, next_key_ptr, value_ptr)
if ret.value == 0: # Success
key_val = key_ptr.unpack(b'I')[0] # Unpack __u32 key
packets, bytes_val = value_ptr.unpack(b'QQ') # Unpack __u64, __u64
print(f" Port {key_val}: Packets={packets}, Bytes={bytes_val}")
keys_found += 1
else:
print(f" Error looking up key: {os.strerror(-ret.value)}")
# Get the next key
next_key_ptr = b.call("bpf_map_get_next_key", inner_map_fd, next_key_ptr, key_ptr)
if next_key_ptr.value == -1: # No more keys
break
if keys_found > 100: # Safety break for potentially large maps
print(" ... (truncated)")
break
if keys_found == 0:
print("No stats collected yet.")
except KeyboardInterrupt:
print("\nDetaching XDP program...")
b.detach_xdp(interface)
print("XDP program detached.")
# Clean up the dynamically created map
print(f"Closing inner map fd: {inner_map_fd}")
os.close(inner_map_fd)
except OSError as e:
print(f"Error: {e}")
print("Ensure you have sufficient privileges (run as root) and the kernel supports dynamic map creation.")
print("Also, ensure the specified interface exists and XDP is supported.")
The core idea is that the outer BPF_MAP_TYPE_PERCPU_ARRAY’s elements are pointers to other maps. When you update an entry in the outer map, you’re essentially telling that CPU’s "slot" in the array to point to a specific inner map. Since all CPUs’ slots in the outer map can be made to point to the same inner map, you achieve a shared, dynamic data structure that each CPU can access independently and efficiently.
The performance gain comes from avoiding locks. When a CPU needs to update a counter for a specific port, it looks up the port in its own conceptual view of the inner map. Since each CPU is operating on its own set of data within the shared hash map, there’s no contention. The BPF_MAP_TYPE_PERCPU_ARRAY itself also has per-CPU internal storage, meaning accessing its elements (the pointers to the inner maps) is lock-free for each CPU.
The true power of this technique lies in its flexibility. You can use map-in-map to implement complex per-CPU data structures like trees, linked lists, or even other nested maps, all managed dynamically from userspace without kernel recompilation. The outer map can be of various types, not just PERCPU_ARRAY; it could be a HASH map where values are other maps, enabling complex hierarchical data structures.
One thing most people don’t realize is how the "pointer" to the inner map is actually represented and passed. When you update the outer map (e.g., BPF_MAP_TYPE_PERCPU_ARRAY) with the file descriptor of the inner map, the kernel internally translates this file descriptor into a kernel pointer to the map’s data structure. The eBPF program then receives this kernel pointer and can use it to perform lookups and updates on the inner map. This translation is a crucial part of how map-in-map functions seamlessly.
This pattern is foundational for building sophisticated, high-performance eBPF applications that require fine-grained, per-CPU state management. The next step you’ll likely encounter is managing the lifecycle of these dynamically created inner maps, ensuring they are cleaned up properly when no longer needed to avoid resource leaks.