CoreDNS can handle massive query volumes, but its default configuration is often tuned for general use, not peak performance.

Let’s watch CoreDNS handle a flood of DNS requests. Imagine a surge of traffic hitting your cluster, and every pod needs to resolve a domain name simultaneously.

# Simulate 1000 concurrent DNS lookups for a single domain
ab -n 1000 -c 100 "http://127.0.0.1:8053/example.com" \
  -H "Host: example.com" -p POST -T "text/plain" \
  -v 2 \
  --content-type "text/plain" \
  --post-file <(echo "GET example.com")

This ab command, while typically used for HTTP, can be adapted to send arbitrary TCP/UDP packets. We’re simulating 1000 concurrent requests hitting a CoreDNS instance listening on 127.0.0.1:8053 (assuming you’ve configured it to listen there for testing). The output will show you latency and error rates under load. If this spikes or shows significant errors, it’s time to tune.

The core problem CoreDNS solves is acting as a fast, flexible, and extensible DNS server within your infrastructure. It replaces kube-dns in Kubernetes, offering better performance and a plugin-based architecture. Each plugin handles a specific DNS query type or function (e.g., kubernetes for cluster-internal lookups, forward for external resolution, cache for speeding up repeated queries).

Internally, CoreDNS uses goroutines for concurrency. Each incoming request is handled by a separate goroutine. The efficiency of these goroutines and the plugins they invoke determines how many queries per second (QPS) CoreDNS can sustain. The Corefile, CoreDNS’s configuration, dictates which plugins are loaded and in what order.

Here’s a typical Corefile snippet for a Kubernetes cluster:

.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

The kubernetes plugin handles internal service discovery. prometheus exposes metrics. forward sends requests to upstream resolvers (like your cloud provider’s DNS). cache stores results.

The key to high-volume performance lies in optimizing these plugins and the underlying server configuration.

Tuning cache: The cache plugin is your first line of defense against high load. A larger cache size and longer TTLs (Time To Live) mean more responses are served directly from memory, without needing to traverse the network or consult other plugins.

  • Diagnosis: Monitor cache hit rates using Prometheus metrics. Look for coredns_cache_hits_total and coredns_cache_misses_total. A low hit rate indicates the cache isn’t effective.
  • Fix: Adjust the cache size and TTL in your Corefile. For high volume, you might increase the default TTLs.
    cache 300 # Cache entries for 5 minutes (300 seconds)
    
  • Why it works: By extending the duration that a DNS record is considered valid (TTL), CoreDNS can serve the same answer from its cache for a longer period, reducing the number of actual lookups it needs to perform.

Tuning forward: The forward plugin is crucial for external lookups. If your upstream resolvers are slow or overloaded, CoreDNS will also appear slow. Increasing the number of concurrent forwarders can help.

  • Diagnosis: Observe coredns_forward_requests_total and coredns_forward_responses_total in Prometheus. High latency on forward requests, or a growing queue of pending forward requests, indicates a bottleneck here.
  • Fix: Increase max_concurrent in the forward plugin.
    forward . 8.8.8.8 8.8.4.4 {
       max_concurrent 10000 # Allow up to 10000 concurrent forward requests
    }
    
  • Why it works: This setting controls the maximum number of concurrent requests CoreDNS will send to each upstream server specified. Increasing it allows CoreDNS to parallelize more outgoing requests, preventing it from becoming a bottleneck when upstream resolvers can handle more.

Server Workers and Goroutines: CoreDNS itself uses goroutines for handling requests. While Go’s scheduler is excellent, you can influence the number of worker goroutines.

  • Diagnosis: If you see high CPU usage on the CoreDNS pods and latency spikes, it might be that the number of available worker goroutines is insufficient to process all incoming requests efficiently.
  • Fix: Set the GOMAXPROCS environment variable for your CoreDNS deployment. A common starting point is to set it to the number of CPU cores available to the pod.
    env:
      - name: GOMAXPROCS
        value: "4" # Set to the number of CPU cores allocated to the pod
    
  • Why it works: GOMAXPROCS tells the Go runtime the maximum number of operating system threads that can execute Go code simultaneously. By matching this to the available CPU cores, you ensure that Go’s scheduler can effectively utilize all the processing power, leading to better concurrency.

Connection Limits and Buffers: Network buffers and connection handling can become a bottleneck under extreme load.

  • Diagnosis: Network-related errors, such as dropped connections or timeouts during high QPS, can point to insufficient buffer sizes or limits on concurrent connections.
  • Fix: You can increase the kernel’s network buffer sizes and file descriptor limits for the CoreDNS pods. This is often done via Kubernetes node configuration or pod security policies. For example, increasing net.core.somaxconn and net.ipv4.tcp_max_syn_backlog on the nodes hosting CoreDNS.
    # On the node:
    sysctl -w net.core.somaxconn=4096
    sysctl -w net.ipv4.tcp_max_syn_backlog=2048
    
    And ensure your CoreDNS deployment has a high ulimit -n (open file descriptors).
  • Why it works: Larger buffers allow the operating system to queue more incoming network packets, and increased backlog limits help manage incoming connection requests during bursts, preventing packet loss and connection failures.

Plugin Order and Specificity: The order of plugins in your Corefile matters. More specific or frequently used plugins should often come earlier.

  • Diagnosis: If you have many forward stanzas or complex rewrite rules, and internal lookups are still slow, the order might be inefficient.
  • Fix: Place the kubernetes plugin (for internal lookups) before forward (for external). If you use rewrite rules, ensure they are as efficient as possible and placed logically.
    .:53 {
        # ... other plugins
        kubernetes cluster.local ... # Internal lookups first
        forward . /etc/resolv.conf { ... } # Then external
        # ...
    }
    
  • Why it works: CoreDNS processes plugins sequentially. By handling the most common or fastest lookups (like internal cluster names) first, you avoid unnecessary processing by slower or more general plugins.

Health Checks and Lameduck: While not directly performance tuning, ensuring health checks are configured correctly prevents CoreDNS from being prematurely removed from service discovery during load spikes, which can exacerbate issues.

  • Diagnosis: If CoreDNS instances are flapping in and out of service during high load, it might be due to aggressive health checks.
  • Fix: Adjust the health plugin’s lameduck period.
    health {
       lameduck 15s # Give CoreDNS 15 seconds to gracefully shut down
    }
    
  • Why it works: The lameduck period gives CoreDNS time to finish processing existing requests before it stops accepting new ones when it’s being terminated or restarted. This smooths out transitions and prevents dropped requests during brief load-induced unresponsiveness.

Once these optimizations are in place, you’ll notice significantly reduced latency and higher QPS capabilities. The next challenge will often be managing the DNS resolution of external services that themselves have high latency or are rate-limiting.

Want structured learning?

Take the full Coredns course →