Datadog Agent on Kubernetes as a DaemonSet is designed to give you visibility into every node in your cluster, but getting it running can feel like trying to herd cats.
Here’s how a Datadog Agent DaemonSet actually works, and how to get it to report metrics from all your nodes.
First, you need to understand that the Datadog Agent isn’t a single process running somewhere. It’s a collection of pods, one per node, managed by Kubernetes. Each pod is an instance of the Datadog Agent, and it’s responsible for collecting metrics and logs from the node it’s running on. These pods are scheduled by Kubernetes as a DaemonSet, meaning Kubernetes ensures that a copy of the Agent pod runs on each node (or a subset of nodes, if you configure node selectors).
The Agent collects data in a few ways:
- Node-level metrics: It runs as a privileged pod, allowing it to access host filesystems and cgroup information to gather CPU, memory, disk, and network metrics for the entire node.
- Container metrics: It leverages the Kubernetes API to discover running containers on its node and then uses the container runtime interface (CRI) or Docker API to collect metrics from those specific containers.
- Logs: It can be configured to tail specific log files on the host or within containers.
- APM and Tracing: For application performance monitoring, the Agent acts as a trace agent, receiving traces from instrumented applications and forwarding them to Datadog.
To install the Datadog Agent as a DaemonSet, you’ll typically use a Helm chart or a YAML manifest provided by Datadog. The core of the configuration involves setting up the Agent’s datadog.apiKey and potentially datadog.appKey for full functionality.
Here’s a snippet from a typical values.yaml for the Datadog Helm chart:
datadog:
apiKey: <YOUR_DATADOG_API_KEY>
appKey: <YOUR_DATADOG_APP_KEY> # Optional, but recommended for Agent features
site: datadoghq.com # Or your Datadog EU site, e.g., datadoghq.eu
# If you're using Cluster Agent for advanced features, configure it here
# clusterAgent:
# enabled: true
agents:
# This is the core DaemonSet configuration
daemonset:
image:
name: datadog/agent
tag: 7.50.0 # Use the latest stable version
env:
- name: DD_CLUSTER_NAME
value: "my-k8s-cluster" # Identify your cluster
- name: DD_KUBERNETES_COLLECTOR_ENABLED
value: "true" # Essential for Kubernetes metrics
- name: DD_LOGS_ENABLED
value: "true" # Enable log collection
- name: DD_APM_ENABLED
value: "true" # Enable APM tracing
When you apply this configuration, Helm will create a DaemonSet resource. Kubernetes then ensures that a pod matching the DaemonSet’s template is scheduled and runs on each eligible node. The privileged nature of the Agent pod (often configured with privileged: true in its security context) is crucial for it to access host-level information.
The most surprising true thing about Datadog Agent deployment on Kubernetes is that its primary mechanism for discovering and collecting metrics from containers is by querying the Kubernetes API server for pod and container information, not by directly inspecting the container runtime on each node. It uses the kubelet API (via container runtime interface or Docker socket) primarily for low-level node resource utilization.
Let’s look at how this discovery works in practice. When you enable DD_KUBERNETES_COLLECTOR_ENABLED, the Agent pod starts a component that watches for Kubernetes events. It registers for Pod and Node object updates. When a new pod starts on its node, the Agent receives a Pod creation event. It then inspects the pod’s metadata, including its labels, annotations, and container configurations. For each container within that pod, the Agent determines what metrics to collect based on its configuration and the container’s labels. It then queries the kubelet for container-specific metrics (like CPU and memory usage per container) or uses the Docker API if that’s the runtime. This event-driven, API-centric approach allows Datadog to dynamically adapt to your cluster’s state without needing to manually configure collections for every new application.
The DD_LOGS_ENABLED setting, when true, tells the Agent to start its log collection process. By default, it often tails logs from /var/log/containers/*.log on the host. These are typically symlinks pointing to the actual container logs managed by the container runtime. The Agent uses its knowledge of pod names and container IDs (obtained from the Kubernetes API) to correlate these logs back to the correct application and pod within Datadog.
For APM, when DD_APM_ENABLED is true, the Agent exposes a UDP port (default 8126) on the host. Your instrumented applications (e.g., Java, Python, Go apps) are configured to send their trace data to this agent port. The Agent then forwards these traces to the Datadog backend. This agent acts as a local aggregation point, reducing the load on your applications and ensuring that traces are sent reliably.
The one thing most people don’t realize is how the Agent manages its network endpoints for APM tracing. When you enable APM, the Agent DaemonSet creates a host-level network endpoint. This means that applications running on the same node can connect to localhost:8126 (or the configured port) and have their traces routed to Datadog. The DaemonSet’s privileged access is essential here, allowing it to bind to host network interfaces and receive traffic destined for the host’s IP address, effectively making the Agent a ubiquitous trace collector for any application on that node, regardless of its pod network namespace.
After successfully installing and verifying that your Agent pods are running and reporting to Datadog, the next hurdle is often ensuring that your containerized applications are properly instrumented for APM to see distributed traces.