CoreDNS can drop DNS requests under heavy load if its worker goroutines become saturated, leading to increased latency and failed lookups.
This often manifests as intermittent client-side timeouts or SERVFAIL responses. The root cause isn’t usually a single component failing, but rather a concurrency bottleneck within CoreDNS itself when processing a surge of queries.
Cause 1: Insufficient Worker Goroutines
CoreDNS uses goroutines to handle incoming DNS requests. If the number of active queries exceeds the number of available worker goroutines, requests will queue up and eventually be dropped.
- Diagnosis: Monitor the
coredns_goroutinesmetric. A sustained high value, close to the configuredmax_goroutines(if set), indicates saturation. Also, check thecoredns_dns_requests_dropped_totalcounter. - Fix: Increase the number of worker goroutines. In your Corefile, within the
.:53or your specific zone block, addmax_goroutines 1000. This tells CoreDNS to spin up to 1000 goroutines to handle requests. - Why it works: More goroutines mean more concurrent request processing, reducing the chance of queuing and dropping.
Cause 2: Inefficient Plugin Chain
Some plugins are more computationally intensive than others. A long or poorly ordered chain can slow down the processing of each request, indirectly leading to goroutine saturation.
- Diagnosis: Use CoreDNS’s built-in profiling. Run
curl localhost:9090/debug/pprof/profile?seconds=30(if/debug/pprofis enabled in your Corefile) and analyze the output for functions that consume the most CPU time. Also, observe thecoredns_plugin_duration_secondsmetrics. - Fix: Reorder or disable unnecessary plugins. For example, if
cacheis configured, ensure it’s placed early in the chain. If a plugin likekubernetesis only needed for specific lookups, consider usingforwardwith more specific selectors. A common optimization iscache 10followed byforward . /etc/resolv.conf. - Why it works: Placing caching earlier serves more requests directly without hitting downstream resolvers. Optimizing plugin order reduces the overall processing time per query.
Cause 3: Aggressive Health Checks
If health check probes (e.g., from a load balancer or Kubernetes readiness/liveness probes) are hitting CoreDNS too frequently and with complex queries, they can consume valuable worker resources.
- Diagnosis: Correlate spikes in
coredns_dns_requests_dropped_totalwith the timing of health check probes. Examine your load balancer or Kubernetes probe configurations. - Fix: Increase the interval between health checks and/or simplify the queries used. For Kubernetes, ensure probes are hitting a non-blocking endpoint if available, or at least a very simple DNS query. For example, a probe could query a known internal service name instead of a broad
.lookup. - Why it works: Less frequent or simpler health checks reduce the load on CoreDNS, freeing up goroutines for actual client traffic.
Cause 4: UDP Buffer Exhaustion
CoreDNS relies on UDP for most DNS traffic. If the operating system’s UDP receive buffer becomes full, incoming packets will be dropped before CoreDNS can even process them.
- Diagnosis: On the CoreDNS host, check
netstat -su. Look for a high number ofreceive packet drops. Also, monitornet.core.rmem_maxandnet.core.rmem_defaultviasysctl -a | grep net.core.rmem. - Fix: Increase UDP buffer sizes. Add or modify these lines in
/etc/sysctl.conf(or a file in/etc/sysctl.d/):
Then apply withnet.core.rmem_max = 13107200 net.core.rmem_default = 13107200 net.ipv4.udp_rmem_max = 13107200 net.ipv4.udp_mem = 40000000 80000000 120000000sysctl -p. - Why it works: Larger buffers allow the OS to hold more incoming UDP packets, reducing the likelihood of drops due to temporary bursts.
Cause 5: Overlapping Network Namespaces/IPs
In containerized environments, if CoreDNS instances are accidentally configured to listen on the same IP address and port within different network namespaces, or if host networking is used and multiple instances try to bind to 0.0.0.0:53, only one will succeed, and the others will fail to start or operate correctly, leading to a perceived drop in availability.
- Diagnosis: Use
ss -ulpn | grep 53on the host. You should see only one process (your CoreDNS instance) listening on UDP port 53. If you see multiple, or none, investigate. Check your deployment configuration for duplicate IP/port bindings. - Fix: Ensure each CoreDNS instance has a unique IP address to bind to if not using host networking, or that if host networking is used, only one instance is configured to listen on
0.0.0.0:53. Review your KubernetesServiceandPoddefinitions. - Why it works: Prevents port conflicts, ensuring that all intended CoreDNS instances are actually running and listening for traffic.
Cause 6: Aggressive Client-Side Retries
While not a CoreDNS internal cause, many clients are configured to retry DNS queries upon timeout. If CoreDNS is dropping requests, these retries can exacerbate the load, creating a feedback loop where the system appears even more overloaded than it is.
- Diagnosis: Observe the overall query rate and the drop rate. If the drop rate is high and sustained, and client logs show repeated attempts for the same records, this is likely happening.
- Fix: Address the upstream CoreDNS issues (Causes 1-5). Additionally, consider slightly increasing the client-side DNS timeout (e.g., from 5s to 10s) to give CoreDNS more breathing room, but this is a workaround, not a solution.
- Why it works: A longer timeout allows a request to live longer in the CoreDNS queue, increasing its chances of being processed before the client gives up.
The next error you’ll likely encounter after fixing these issues is related to upstream resolver timeouts if your forward plugin is misconfigured or the upstream resolvers are slow.