The Kubernetes DNS resolver is failing to get answers from CoreDNS quickly enough, causing pods to time out when trying to reach other services.
It’s almost always because the DNS client within your pods is trying to resolve names that don’t exist, and it’s doing so by making a ton of DNS queries that traverse the entire configured search domain list.
Here are the common culprits and how to fix them:
1. Pods are trying to resolve short, unqualified names that don’t exist in their local namespace.
This is the most frequent offender. When a pod tries to resolve my-service and my-service isn’t in the same namespace, the Kubernetes DNS resolver (usually CoreDNS) will try to resolve my-service.my-namespace.svc.cluster.local. If it can’t find it, it then tries my-service.svc.cluster.local, my-service.cluster.local, and finally my-service.cluster, each attempt adding latency.
-
Diagnosis: Look at your application logs. You’ll see repeated "lookup" errors for short names. You can also exec into a pod and use
digornslookupwith a short name, then observe the query paths and timings. For example:kubectl exec -it <pod-name> -- sh # inside the pod dig my-nonexistent-serviceWatch the
;SERVER:and;QUERY:lines in the output. If you see many.terminated queries, you’ve found the problem. -
Fix: Increase the
ndotssetting in thednsConfigfor your pods.ndotsis the number of dots in a name after which the resolver will first try the fully qualified name. The default in Kubernetes is usually 5. If your cluster’s search path isnamespace.svc.cluster.local svc.cluster.local cluster.local, and you querymy-service, the resolver will try:my-service.namespace.svc.cluster.local(1 dot)my-service.svc.cluster.local(2 dots)my-service.cluster.local(3 dots)my-service.cluster(4 dots)my-service(0 dots - this is where it might get stuck or try to resolve the short name directly ifndotsis low)
By setting
ndots: 2orndots: 1(if your cluster has a simple search path), you tell the resolver to try the fully qualified name (my-service.namespace.svc.cluster.local) before it starts appending the search domains. In your pod spec:apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: my-container image: my-image dnsPolicy: "None" # Important: this lets you control dnsConfig dnsConfig: nameservers: - <your-coredns-service-ip> # e.g., 10.96.0.10 searches: - my-namespace.svc.cluster.local - svc.cluster.local - cluster.local options: - name: ndots value: "2"This forces the resolver to try
my-service.my-namespace.svc.cluster.localfirst. If that doesn’t exist, andmy-serviceis truly a local service, it will then try the remaining search domains. If the service doesn’t exist in the cluster, this prevents the resolver from wasting time trying to append every search domain. -
Why it works: It prioritizes the most specific, fully qualified name. If your application is trying to resolve
my-serviceand it should really bemy-service.my-namespace.svc.cluster.local, settingndots: 2makes the resolver try that specific name first. If it exists, you get a fast response. If it doesn’t, it’s a quick negative response. This avoids the cascaded lookups.
2. Overly long or complex search domain lists. Kubernetes automatically injects a search domain list into pods based on their namespace. If your cluster is configured with many levels of hierarchy, or if you have custom search domains added, this list can become long. Each entry in the search list is tried sequentially if the preceding ones fail.
-
Diagnosis: Use
kubectl exec -it <pod-name> -- cat /etc/resolv.conf. Examine thesearchline. Count the number of entries. -
Fix: Manually define the
dnsConfig.searchesin your pod/deployment spec to be as minimal as possible, only including the necessary domains. For most workloads, this meansmy-namespace.svc.cluster.local,svc.cluster.local, andcluster.localare sufficient.dnsConfig: nameservers: - <your-coredns-service-ip> searches: - my-namespace.svc.cluster.local - svc.cluster.local - cluster.local options: - name: ndots value: "2"If you use
dnsPolicy: "Default", Kubernetes manages/etc/resolv.conffor you. To override, you must setdnsPolicy: "None"and provide your owndnsConfig. -
Why it works: Reduces the number of DNS queries the resolver has to attempt before getting a definitive answer (or failure). Shorter search lists mean fewer round trips to CoreDNS.
3. CoreDNS itself is overloaded or misconfigured. While less common for latency specifically (often manifests as outright failures or timeouts), a struggling CoreDNS can contribute. This could be due to too many concurrent requests, inefficient upstream resolvers, or plugins that are slow.
-
Diagnosis:
- Check CoreDNS pod logs for errors or excessive "plugin" processing times.
- Monitor CoreDNS pod CPU and memory usage.
- Use
kubectl logs <coredns-pod-name> -n kube-system -c corednsand look for long processing times or specific plugin performance issues. - If CoreDNS forwards to external resolvers, check the latency of those external resolvers.
-
Fix:
- Scale CoreDNS: Increase the replica count of your CoreDNS deployment.
kubectl scale deployment coredns --replicas=3 -n kube-system - Optimize CoreDNS Configuration: Review your
Corefile. For example, if you haveforwarddirectives pointing to slow external DNS servers, consider changing them or adding caching. A typicalCorefilemight look like this:
The.:53 { errors health { lameduck 5s } ready kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa } prometheus :9153 cache 30 loop reload # If forwarding to external DNS, ensure they are responsive. # Consider a local caching resolver if external ones are slow. # forward . 8.8.8.8 8.8.4.4 }cache 30directive caches DNS responses for 30 seconds, reducing the load on upstream resolvers and CoreDNS itself for repeat queries.
- Scale CoreDNS: Increase the replica count of your CoreDNS deployment.
-
Why it works: More CoreDNS replicas can handle more concurrent requests. Caching reduces the need to hit upstream resolvers for every query.
4. Network policy blocking DNS traffic or specific ports. Less common for internal cluster DNS, but possible if you have strict network policies in place.
-
Diagnosis: Check
NetworkPolicyresources in your namespace. Ensure they allow egress traffic from your pods to the CoreDNS service IP on UDP/TCP port 53. -
Fix: Add or modify
NetworkPolicyto permit the necessary traffic.apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-dns namespace: <your-app-namespace> spec: podSelector: {} # Apply to all pods in the namespace policyTypes: - Egress egress: - to: - ipBlock: cidr: <your-cluster-cidr> # e.g., 10.0.0.0/8 or specific CoreDNS service IP CIDR ports: - protocol: UDP port: 53 - protocol: TCP port: 53You’ll need to know your cluster’s Pod CIDR and the IP of the CoreDNS service (e.g.,
kubectl get svc -n kube-system kube-dnsorcoredns). -
Why it works: Explicitly allows the DNS packets to reach the CoreDNS server.
5. Node-level DNS issues or incorrect /etc/resolv.conf on nodes.
If pods are configured to use the node’s DNS resolver (dnsPolicy: "Default" and resolvConf points to the node’s settings), issues on the node can manifest.
-
Diagnosis: Exec into the node where the pod is running. Check
/etc/resolv.confon the node. Ensure it points to valid DNS servers and that thendotssetting is appropriate for the node’s network environment. -
Fix: Correct the node’s
/etc/resolv.confor ensure the node’s DNS client is functioning correctly. This is less common in managed Kubernetes environments where node networking is handled. -
Why it works: Ensures the underlying mechanism pods rely on for DNS resolution is sound.
After fixing the ndots and potentially simplifying your search domains, the next error you’ll likely encounter is a much faster, definitive "service not found" error if the service truly doesn’t exist, or a quick successful resolution if it does.