Elastic Fabric Adapter (EFA) is a network interface that dramatically improves the performance of High-Performance Computing (HPC) and machine learning applications running on Amazon EC2.

Here’s a real-world example of EFA in action. Imagine a cluster of c5n.metal instances, each equipped with EFA. We’re running a large-scale fluid dynamics simulation using MPI.

# On the head node, setting up the MPI environment
export AWS_EFA_ENABLE_COMM_WORLD_AS_ROOT=1
export FI_PROVIDER=aws

# On each compute node, launching the MPI job
mpirun \
    --hostfile ~/mpi_hosts \
    --np 128 \
    --bind-to core \
    --map-by ppr:2:node \
    --allow-run-as-root \
    /path/to/your/hpc_application

In this scenario, EFA allows inter-node communication to bypass the operating system kernel and send data directly from the application’s memory buffer to the network, and vice-versa. This significantly reduces latency and increases throughput, which are critical for tightly coupled HPC workloads like this simulation.

The core problem EFA solves is the network bottleneck inherent in traditional TCP/IP-based communication for large-scale parallel processing. Standard network stacks add overhead through system calls, context switches, and data copying. EFA, on the other hand, provides a low-latency, high-bandwidth fabric by implementing the Message Passing Interface (MPI) and libfabric interfaces directly on the network hardware.

Internally, EFA leverages custom AWS silicon and specialized network hardware. When an MPI send operation is initiated, the EFA driver interacts directly with the EFA network card. This card then handles the data transmission and reception without involving the CPU or the OS kernel for every packet. For receiving data, EFA can place incoming data directly into the application’s pre-registered memory buffers, eliminating the need for the application to poll or perform extra data copies.

The key levers you control with EFA are:

  • Instance Type: EFA is supported on specific EC2 instance types optimized for compute-intensive workloads, such as c5n, m5n, r5n, and various HPC-optimized instances. The 'n' suffix typically indicates enhanced networking with EFA capabilities.
  • EFA Placement: When launching EC2 instances, you must enable EFA for the network interface. This is done via the NetworkInterfaces parameter in the RunInstances API call or through the EC2 console when creating instances. You’ll specify Groups that allow traffic between your instances, typically on ports used by your MPI communication.
  • EFA Licensing (for certain OS): On some Linux distributions, you might need to enable EFA functionality through specific kernel modules or user-space libraries. For example, on Amazon Linux 2, EFA is typically available out-of-the-box. On other distributions, you might need to install the aws-efa-ப்பூர் package.
  • MPI/Libfabric Configuration: Your HPC application needs to be compiled or configured to use an MPI implementation that supports EFA, and that implementation must be told to use the EFA (libfabric) provider. This is often done through environment variables like FI_PROVIDER=aws or FI_PROVIDER=efa.

When you enable EFA, you’re not just getting faster networking; you’re fundamentally changing how your application communicates. Instead of relying on the OS to manage every byte of network traffic, EFA allows your application to speak directly to the network hardware. This bypasses significant overhead, especially in highly parallel applications where millions of small messages are exchanged. The kernel’s network stack, with its inherent latency and processing costs, is largely sidestepped.

Many users are unaware that EFA’s performance benefits are most pronounced when applications use small, frequent messages. While EFA excels at high throughput, its true magic lies in its ability to reduce the latency associated with these numerous, small data transfers that are typical in tightly synchronized HPC workloads. This is achieved through its direct memory access capabilities and the elimination of kernel-level packet processing for each message.

The next step after successfully connecting your EC2 HPC cluster with EFA is to optimize your MPI application’s communication patterns to fully leverage the reduced latency and increased bandwidth.

Want structured learning?

Take the full Ec2 course →