CoreDNS Production Best Practices: Scaling and Reliability (2026)

CoreDNS can drop DNS requests under heavy load if its worker goroutines become saturated, leading to increased latency and failed lookups.

This often manifests as intermittent client-side timeouts or SERVFAIL responses. The root cause isn’t usually a single component failing, but rather a concurrency bottleneck within CoreDNS itself when processing a surge of queries.

Cause 1: Insufficient Worker Goroutines

CoreDNS uses goroutines to handle incoming DNS requests. If the number of active queries exceeds the number of available worker goroutines, requests will queue up and eventually be dropped.

Diagnosis: Monitor the coredns_goroutines metric. A sustained high value, close to the configured max_goroutines (if set), indicates saturation. Also, check the coredns_dns_requests_dropped_total counter.
Fix: Increase the number of worker goroutines. In your Corefile, within the .:53 or your specific zone block, add max_goroutines 1000. This tells CoreDNS to spin up to 1000 goroutines to handle requests.
Why it works: More goroutines mean more concurrent request processing, reducing the chance of queuing and dropping.

Cause 2: Inefficient Plugin Chain

Some plugins are more computationally intensive than others. A long or poorly ordered chain can slow down the processing of each request, indirectly leading to goroutine saturation.

Diagnosis: Use CoreDNS’s built-in profiling. Run curl localhost:9090/debug/pprof/profile?seconds=30 (if /debug/pprof is enabled in your Corefile) and analyze the output for functions that consume the most CPU time. Also, observe the coredns_plugin_duration_seconds metrics.
Fix: Reorder or disable unnecessary plugins. For example, if cache is configured, ensure it’s placed early in the chain. If a plugin like kubernetes is only needed for specific lookups, consider using forward with more specific selectors. A common optimization is cache 10 followed by forward . /etc/resolv.conf.
Why it works: Placing caching earlier serves more requests directly without hitting downstream resolvers. Optimizing plugin order reduces the overall processing time per query.

Cause 3: Aggressive Health Checks

If health check probes (e.g., from a load balancer or Kubernetes readiness/liveness probes) are hitting CoreDNS too frequently and with complex queries, they can consume valuable worker resources.

Diagnosis: Correlate spikes in coredns_dns_requests_dropped_total with the timing of health check probes. Examine your load balancer or Kubernetes probe configurations.
Fix: Increase the interval between health checks and/or simplify the queries used. For Kubernetes, ensure probes are hitting a non-blocking endpoint if available, or at least a very simple DNS query. For example, a probe could query a known internal service name instead of a broad . lookup.
Why it works: Less frequent or simpler health checks reduce the load on CoreDNS, freeing up goroutines for actual client traffic.

Cause 4: UDP Buffer Exhaustion

CoreDNS relies on UDP for most DNS traffic. If the operating system’s UDP receive buffer becomes full, incoming packets will be dropped before CoreDNS can even process them.

Diagnosis: On the CoreDNS host, check netstat -su. Look for a high number of receive packet drops. Also, monitor net.core.rmem_max and net.core.rmem_default via sysctl -a | grep net.core.rmem.

Fix: Increase UDP buffer sizes. Add or modify these lines in /etc/sysctl.conf (or a file in /etc/sysctl.d/):

net.core.rmem_max = 13107200
net.core.rmem_default = 13107200
net.ipv4.udp_rmem_max = 13107200
net.ipv4.udp_mem = 40000000 80000000 120000000

Then apply with sysctl -p.

Why it works: Larger buffers allow the OS to hold more incoming UDP packets, reducing the likelihood of drops due to temporary bursts.

Cause 5: Overlapping Network Namespaces/IPs

In containerized environments, if CoreDNS instances are accidentally configured to listen on the same IP address and port within different network namespaces, or if host networking is used and multiple instances try to bind to 0.0.0.0:53, only one will succeed, and the others will fail to start or operate correctly, leading to a perceived drop in availability.

Diagnosis: Use ss -ulpn | grep 53 on the host. You should see only one process (your CoreDNS instance) listening on UDP port 53. If you see multiple, or none, investigate. Check your deployment configuration for duplicate IP/port bindings.
Fix: Ensure each CoreDNS instance has a unique IP address to bind to if not using host networking, or that if host networking is used, only one instance is configured to listen on 0.0.0.0:53. Review your Kubernetes Service and Pod definitions.
Why it works: Prevents port conflicts, ensuring that all intended CoreDNS instances are actually running and listening for traffic.

Cause 6: Aggressive Client-Side Retries

While not a CoreDNS internal cause, many clients are configured to retry DNS queries upon timeout. If CoreDNS is dropping requests, these retries can exacerbate the load, creating a feedback loop where the system appears even more overloaded than it is.

Diagnosis: Observe the overall query rate and the drop rate. If the drop rate is high and sustained, and client logs show repeated attempts for the same records, this is likely happening.
Fix: Address the upstream CoreDNS issues (Causes 1-5). Additionally, consider slightly increasing the client-side DNS timeout (e.g., from 5s to 10s) to give CoreDNS more breathing room, but this is a workaround, not a solution.
Why it works: A longer timeout allows a request to live longer in the CoreDNS queue, increasing its chances of being processed before the client gives up.

The next error you’ll likely encounter after fixing these issues is related to upstream resolver timeouts if your forward plugin is misconfigured or the upstream resolvers are slow.