Datadog Network Performance Monitoring (NPM) doesn’t just show you if there’s traffic; it reveals the hidden conversations between your services and how long those whispers take to travel.

Let’s see it in action. Imagine you’ve got a microservice architecture. You deploy a new version of your user-service and immediately, your dashboards show a spike in latency for requests hitting the order-service.

Here’s a typical trace in Datadog NPM:

Service A (user-service) -> Service B (order-service)
  - Request Count: 15,000/min
  - Avg Latency: 850ms (normal is < 200ms)
  - Error Rate: 5% (up from 0.1%)

This immediately tells you something is wrong between user-service and order-service. You can then drill down using Datadog’s Network Map or by filtering traffic by specific hosts, ports, or tags. You might see that the increased latency is correlated with a specific subnet or even a particular container instance.

The magic of Datadog NPM is its ability to capture network flows at the packet level, then aggregate and enrich this data without requiring any code changes or agents installed on your application servers. It leverages eBPF (extended Berkeley Packet Filter) on Linux hosts or captures network traffic via packet mirroring (SPAN ports) on cloud environments or network hardware. This data is then sent to Datadog, where it’s analyzed to provide metrics like connection duration, data transfer volume, latency, and error rates between any two network endpoints.

The core problem NPM solves is the "black box" problem of distributed systems. When a request fails or slows down, it’s often unclear whether the issue lies within the application code, the underlying infrastructure, or the network itself. NPM provides the visibility to pinpoint network-related bottlenecks. You can see, for instance, if requests are being dropped at the firewall, if there’s packet loss between availability zones, or if a particular service is saturating its network interface.

Here’s how you configure it:

For Cloud Environments (AWS, Azure, GCP): You’ll typically set up VPC Traffic Mirroring (AWS) or equivalent features. Datadog provides agents that can be deployed to collect this mirrored traffic.

  • AWS: Configure a Traffic Mirror Target and Traffic Mirror Session. The Datadog agent would then be configured to listen on the mirrored interface.
  • Azure: Use Network Watcher’s Packet Capture.
  • GCP: Utilize VPC Flow Logs and potentially Packet Mirroring.

For On-Premises/Kubernetes: You’ll deploy the Datadog Agent with the Network Performance Monitoring integration enabled. For Kubernetes, this often involves running the agent in privileged mode or using DaemonSets to capture traffic at the node level.

  • Kubernetes DaemonSet Example (datadog-agent deployment):
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: datadog-agent
      # ... other metadata
    spec:
      template:
        spec:
          containers:
          - name: agent
            image: datadog/agent:7.x.x
            securityContext:
              privileged: true # Crucial for network capture
            env:
              - name: DD_API_KEY
                value: "YOUR_DATADOG_API_KEY"
              - name: DD_SITE
                value: "datadoghq.com"
              - name: DD_NETWORK_MONITORING_ENABLED
                value: "true" # Enable NPM
              # ... other configurations
    
    The privileged: true setting is key here, allowing the agent to access network interfaces and packet data directly.

The DD_NETWORK_MONITORING_ENABLED: "true" environment variable is the switch that tells the Datadog agent to start collecting and processing network flow data. It doesn’t require modifying your application’s code or its dependencies, making it a powerful tool for gaining immediate network visibility.

One aspect often overlooked is how Datadog NPM handles encrypted traffic. While it can’t inspect the contents of TLS-encrypted packets without additional configuration (like sidecar proxies or TLS decryption), it can still provide valuable metadata. It precisely measures the TCP handshake duration, the SSL/TLS handshake duration, and the overall round-trip time for the connection, as well as the volume of data transferred. This allows you to identify latency before the application layer even processes the request, distinguishing network-level SSL/TLS overhead from application-level processing delays.

Once you’re comfortable with basic network traffic analysis, the next logical step is to correlate this network performance data with application traces and logs to achieve full-stack observability.

Want structured learning?

Take the full Datadog course →