Datadog can monitor Kubernetes clusters on GKE, EKS, and AKS by collecting metrics, logs, and traces from your cluster components and workloads.

Let’s see it in action.

Imagine you have a simple Nginx deployment running on your GKE cluster. You want to see its request latency, error rates, and resource utilization, all within Datadog.

First, you’ll install the Datadog Agent as a DaemonSet on your Kubernetes cluster. This agent is responsible for collecting data.

Here’s a snippet of the Datadog Agent DaemonSet YAML, focusing on the core configuration:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: datadog-agent
  namespace: datadog
spec:
  template:
    spec:
      containers:
      - name: agent
        image: datadog/agent:7.39.0-kubernetes
        env:
        - name: DD_API_KEY
          valueFrom:
            secretKeyRef:
              name: datadog-secret
              key: api-key
        - name: DD_SITE
          value: "datadoghq.com" # or datadoghq.eu, datadog.us3, etc.
        - name: DD_KUBERNETES_CLUSTER_NAME
          value: "my-gke-cluster" # This is crucial for tagging
        - name: DD_LOGS_ENABLED
          value: "true"
        - name: DD_LOG_PROCESSING_FILE_PATH
          value: "/var/log/pods/*/*.log"
        - name: DD_KUBERNETES_KUBELET_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        ports:
        - containerPort: 5005
          name: agent-metrics

When this DaemonSet is deployed, the datadog-agent pods will run on each Kubernetes node. Each agent pod will then:

  • Scrape metrics: It looks for annotations on your Kubernetes pods and services (like Prometheus scrape endpoints) and pulls metrics. For built-in Kubernetes metrics (like kube_pod_info, kube_node_status_allocatable), it uses the Kubernetes API.
  • Collect logs: It’s configured to tail logs from /var/log/pods/*/*.log, which is where Kubernetes stores container logs. It then forwards these logs to Datadog.
  • Gather traces: If you have applications instrumented with Datadog’s APM libraries, the agent will also collect and forward trace data.

The DD_KUBERNETES_CLUSTER_NAME environment variable is fundamental. It tags all collected data with the name of your cluster, allowing you to filter and aggregate metrics specifically for that cluster in Datadog dashboards.

Once the agent is running and sending data, you’ll see your cluster appear in the Datadog Kubernetes integration. You can then create dashboards. For your Nginx deployment, you might have a dashboard showing:

  • Request Rate: sum:nginx.requests.count{kube_cluster:my-gke-cluster, kube_namespace:default, app:nginx} by {kube_pod_name}
  • P95 Latency: p95:nginx.request_time{kube_cluster:my-gke-cluster, kube_namespace:default, app:nginx} by {kube_pod_name}
  • CPU Usage: sum:kubernetes.cpu.usage.total{kube_cluster:my-gke-cluster, kube_namespace:default, app:nginx} by {kube_pod_name}
  • Memory Usage: sum:kubernetes.memory.usage.bytes{kube_cluster:my-gke-cluster, kube_namespace:default, app:nginx} by {kube_pod_name}

The agent also automatically collects cluster-level metrics, such as node CPU/memory usage, pod counts, and API server request rates, all tagged by cluster, node, and namespace.

The Datadog Agent’s ability to automatically discover and collect metrics from Prometheus endpoints (via annotations like prometheus.io/scrape: "true") is a significant time-saver. You don’t need to manually configure each application’s metrics collection; just annotate your Kubernetes Service or Pod, and the agent picks it up.

Beyond metrics, Datadog’s Kubernetes integration provides network performance monitoring using eBPF, security monitoring for Kubernetes-specific threats, and the ability to correlate logs and traces directly with your infrastructure metrics.

The DD_SITE variable is critical for directing your data to the correct Datadog datacenter. If you’re using Datadog in the EU, you’d set DD_SITE: "datadoghq.eu". Failing to set this correctly means your data goes to the wrong place, or nowhere at all.

The DD_KUBERNETES_KUBELET_HOST field allows the agent to correctly identify the IP address of the node it’s running on, which is essential for accurate network and host-level metrics.

The real power comes from the rich metadata Datadog collects. Every metric, log, and trace is tagged with Kubernetes attributes like kube_cluster, kube_namespace, kube_deployment, kube_pod_name, and kube_container_name. This allows for incredibly granular analysis and alerting. You can easily slice and dice performance data across different deployments, namespaces, or even specific pods.

The Datadog Agent also intelligently handles restarts and scaling. As pods are rescheduled or new nodes are added to your cluster, the DaemonSet ensures an agent pod is always running on each node, maintaining continuous visibility.

The primary mechanism for Datadog to understand your cluster topology is through the Kubernetes API. The agent queries this API to get information about pods, services, deployments, nodes, and more. This information is then used to enrich the telemetry data it collects, providing context for your metrics, logs, and traces.

You can extend the Datadog Agent’s capabilities by installing integrations. For example, if you’re using Prometheus for metrics but want to centralize everything in Datadog, the agent can scrape Prometheus endpoints. For cloud provider specific metrics (like GKE’s GCE instance metrics), Datadog can also collect these via cloud integrations, correlating them with your Kubernetes data.

The next step is often to set up custom alerts based on the metrics you’re now collecting.

Want structured learning?

Take the full Datadog course →