The Kubernetes DNS resolver is failing to get answers from CoreDNS quickly enough, causing pods to time out when trying to reach other services.

It’s almost always because the DNS client within your pods is trying to resolve names that don’t exist, and it’s doing so by making a ton of DNS queries that traverse the entire configured search domain list.

Here are the common culprits and how to fix them:

1. Pods are trying to resolve short, unqualified names that don’t exist in their local namespace. This is the most frequent offender. When a pod tries to resolve my-service and my-service isn’t in the same namespace, the Kubernetes DNS resolver (usually CoreDNS) will try to resolve my-service.my-namespace.svc.cluster.local. If it can’t find it, it then tries my-service.svc.cluster.local, my-service.cluster.local, and finally my-service.cluster, each attempt adding latency.

  • Diagnosis: Look at your application logs. You’ll see repeated "lookup" errors for short names. You can also exec into a pod and use dig or nslookup with a short name, then observe the query paths and timings. For example:

    kubectl exec -it <pod-name> -- sh
    # inside the pod
    dig my-nonexistent-service
    

    Watch the ;SERVER: and ;QUERY: lines in the output. If you see many . terminated queries, you’ve found the problem.

  • Fix: Increase the ndots setting in the dnsConfig for your pods. ndots is the number of dots in a name after which the resolver will first try the fully qualified name. The default in Kubernetes is usually 5. If your cluster’s search path is namespace.svc.cluster.local svc.cluster.local cluster.local, and you query my-service, the resolver will try:

    1. my-service.namespace.svc.cluster.local (1 dot)
    2. my-service.svc.cluster.local (2 dots)
    3. my-service.cluster.local (3 dots)
    4. my-service.cluster (4 dots)
    5. my-service (0 dots - this is where it might get stuck or try to resolve the short name directly if ndots is low)

    By setting ndots: 2 or ndots: 1 (if your cluster has a simple search path), you tell the resolver to try the fully qualified name (my-service.namespace.svc.cluster.local) before it starts appending the search domains. In your pod spec:

    apiVersion: v1
    kind: Pod
    metadata:
      name: my-app
    spec:
      containers:
      - name: my-container
        image: my-image
      dnsPolicy: "None" # Important: this lets you control dnsConfig
      dnsConfig:
        nameservers:
          - <your-coredns-service-ip> # e.g., 10.96.0.10
        searches:
          - my-namespace.svc.cluster.local
          - svc.cluster.local
          - cluster.local
        options:
          - name: ndots
            value: "2"
    

    This forces the resolver to try my-service.my-namespace.svc.cluster.local first. If that doesn’t exist, and my-service is truly a local service, it will then try the remaining search domains. If the service doesn’t exist in the cluster, this prevents the resolver from wasting time trying to append every search domain.

  • Why it works: It prioritizes the most specific, fully qualified name. If your application is trying to resolve my-service and it should really be my-service.my-namespace.svc.cluster.local, setting ndots: 2 makes the resolver try that specific name first. If it exists, you get a fast response. If it doesn’t, it’s a quick negative response. This avoids the cascaded lookups.

2. Overly long or complex search domain lists. Kubernetes automatically injects a search domain list into pods based on their namespace. If your cluster is configured with many levels of hierarchy, or if you have custom search domains added, this list can become long. Each entry in the search list is tried sequentially if the preceding ones fail.

  • Diagnosis: Use kubectl exec -it <pod-name> -- cat /etc/resolv.conf. Examine the search line. Count the number of entries.

  • Fix: Manually define the dnsConfig.searches in your pod/deployment spec to be as minimal as possible, only including the necessary domains. For most workloads, this means my-namespace.svc.cluster.local, svc.cluster.local, and cluster.local are sufficient.

    dnsConfig:
      nameservers:
        - <your-coredns-service-ip>
      searches:
        - my-namespace.svc.cluster.local
        - svc.cluster.local
        - cluster.local
      options:
        - name: ndots
          value: "2"
    

    If you use dnsPolicy: "Default", Kubernetes manages /etc/resolv.conf for you. To override, you must set dnsPolicy: "None" and provide your own dnsConfig.

  • Why it works: Reduces the number of DNS queries the resolver has to attempt before getting a definitive answer (or failure). Shorter search lists mean fewer round trips to CoreDNS.

3. CoreDNS itself is overloaded or misconfigured. While less common for latency specifically (often manifests as outright failures or timeouts), a struggling CoreDNS can contribute. This could be due to too many concurrent requests, inefficient upstream resolvers, or plugins that are slow.

  • Diagnosis:

    • Check CoreDNS pod logs for errors or excessive "plugin" processing times.
    • Monitor CoreDNS pod CPU and memory usage.
    • Use kubectl logs <coredns-pod-name> -n kube-system -c coredns and look for long processing times or specific plugin performance issues.
    • If CoreDNS forwards to external resolvers, check the latency of those external resolvers.
  • Fix:

    • Scale CoreDNS: Increase the replica count of your CoreDNS deployment.
      kubectl scale deployment coredns --replicas=3 -n kube-system
      
    • Optimize CoreDNS Configuration: Review your Corefile. For example, if you have forward directives pointing to slow external DNS servers, consider changing them or adding caching. A typical Corefile might look like this:
      .:53 {
          errors
          health {
             lameduck 5s
          }
          ready
          kubernetes cluster.local in-addr.arpa ip6.arpa {
             pods insecure
             fallthrough in-addr.arpa ip6.arpa
          }
          prometheus :9153
          cache 30
          loop
          reload
          # If forwarding to external DNS, ensure they are responsive.
          # Consider a local caching resolver if external ones are slow.
          # forward . 8.8.8.8 8.8.4.4
      }
      
      The cache 30 directive caches DNS responses for 30 seconds, reducing the load on upstream resolvers and CoreDNS itself for repeat queries.
  • Why it works: More CoreDNS replicas can handle more concurrent requests. Caching reduces the need to hit upstream resolvers for every query.

4. Network policy blocking DNS traffic or specific ports. Less common for internal cluster DNS, but possible if you have strict network policies in place.

  • Diagnosis: Check NetworkPolicy resources in your namespace. Ensure they allow egress traffic from your pods to the CoreDNS service IP on UDP/TCP port 53.

  • Fix: Add or modify NetworkPolicy to permit the necessary traffic.

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-dns
      namespace: <your-app-namespace>
    spec:
      podSelector: {} # Apply to all pods in the namespace
      policyTypes:
      - Egress
      egress:
      - to:
        - ipBlock:
            cidr: <your-cluster-cidr> # e.g., 10.0.0.0/8 or specific CoreDNS service IP CIDR
        ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    

    You’ll need to know your cluster’s Pod CIDR and the IP of the CoreDNS service (e.g., kubectl get svc -n kube-system kube-dns or coredns).

  • Why it works: Explicitly allows the DNS packets to reach the CoreDNS server.

5. Node-level DNS issues or incorrect /etc/resolv.conf on nodes. If pods are configured to use the node’s DNS resolver (dnsPolicy: "Default" and resolvConf points to the node’s settings), issues on the node can manifest.

  • Diagnosis: Exec into the node where the pod is running. Check /etc/resolv.conf on the node. Ensure it points to valid DNS servers and that the ndots setting is appropriate for the node’s network environment.

  • Fix: Correct the node’s /etc/resolv.conf or ensure the node’s DNS client is functioning correctly. This is less common in managed Kubernetes environments where node networking is handled.

  • Why it works: Ensures the underlying mechanism pods rely on for DNS resolution is sound.

After fixing the ndots and potentially simplifying your search domains, the next error you’ll likely encounter is a much faster, definitive "service not found" error if the service truly doesn’t exist, or a quick successful resolution if it does.

Want structured learning?

Take the full Coredns course →