CoreDNS is actually a surprisingly flexible DNS server, but most people only ever use it to resolve names, missing its ability to act as a powerful observability tool.

Let’s see what that looks like. Imagine you’ve got a Kubernetes cluster and you’re running CoreDNS. You’ve deployed a simple web service, my-app, and you want to check its DNS resolution.

kubectl exec -it busybox -- nslookup my-app.default.svc.cluster.local

This simple nslookup is hitting CoreDNS. But what if my-app isn’t resolving? Or what if it’s taking too long? That’s where monitoring comes in. CoreDNS exposes metrics that Grafana can scrape.

Here’s a snippet of what those metrics might look like, exposed over HTTP on port 9153 by default:

# HELP coredns_dns_request_duration_seconds The latency of DNS requests.
# TYPE coredns_dns_request_duration_seconds histogram
coredns_dns_request_duration_seconds_bucket{addr="10.0.0.1:53",code="NOERROR",error="<nil>",family="inet",proto="udp",server="10.0.0.1:53",zone="."} 10
coredns_dns_request_duration_seconds_bucket{addr="10.0.0.1:53",code="NOERROR",error="<nil>",family="inet",proto="udp",server="10.0.0.1:53",zone="."} 25
coredns_dns_request_duration_seconds_count{addr="10.0.0.1:53",code="NOERROR",error="<nil>",family="inet",proto="udp",server="10.0.0.1:53",zone="."} 50
coredns_dns_request_duration_seconds_sum{addr="10.0.0.1:53",code="NOERROR",error="<nil>",family="inet",proto="udp",server="10.0.0.1:53",zone="."} 0.05

This tells you about the latency of DNS requests, broken down by destination address, DNS response code (like NOERROR), protocol, and the zone being queried. You also see counts of requests and the total time spent.

To get this into Grafana, you’ll typically set up a Prometheus instance to scrape the /metrics endpoint of your CoreDNS pods. In your prometheus.yml configuration, you might have something like this within your scrape_configs:

- job_name: 'coredns'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: (\d+)
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
      separator: '-'
      target_label: instance
    - regex: <__meta_kubernetes_pod_annotation_prometheus_io_path>
      action: replace
      source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]

This configuration tells Prometheus to discover CoreDNS pods based on a specific annotation (prometheus.io/scrape: "true") and scrape their metrics endpoint. You’d add this annotation to your CoreDNS deployment.

Once Prometheus is collecting the metrics, you can import a pre-built CoreDNS dashboard into Grafana. Many community dashboards exist that visualize metrics like:

  • DNS Request Rate: How many queries are being handled per second.
  • DNS Response Codes: Distribution of NOERROR, NXDOMAIN, SERVFAIL, etc., crucial for spotting resolution issues.
  • Request Latency: Average and percentile latency for DNS lookups, vital for application performance.
  • Cache Hit Rate: How often CoreDNS is serving answers from its cache, indicating efficiency.
  • Upstream Server Performance: If CoreDNS forwards queries to external resolvers, you can monitor their latency and success rates.

The real power comes when you correlate these metrics. For instance, if you see a spike in NXDOMAIN responses and a corresponding increase in latency for queries to a specific upstream server, you’ve just pinpointed a problem outside your cluster impacting internal resolution.

A common misconception is that CoreDNS metrics are only for "DNS experts." In reality, they provide direct insight into the health of your cluster’s internal and external name resolution, which is fundamental for almost every Kubernetes workload. If your pods can’t talk to each other by name, nothing works.

The coredns_forward_duration_seconds metric, for example, allows you to precisely measure the latency incurred when CoreDNS has to ask another DNS server for an answer. This is distinct from the total query time and is critical for understanding bottlenecks that aren’t CoreDNS itself but rather its dependencies.

If you’ve got CoreDNS configured to use multiple upstream servers, you’ll see metrics for each one, allowing you to identify if one is slow or failing while others are healthy.

The next step in mastering CoreDNS observability is to start creating alert rules in Prometheus based on these metrics, such as alerts for high SERVFAIL rates or sustained increases in request latency.

Want structured learning?

Take the full Coredns course →