CoreDNS can simulate DNS failures using its chaos plugin, which allows you to inject errors into DNS responses to test how your applications handle them.
Here’s how CoreDNS’s chaos plugin works in practice. Imagine a simple DNS lookup for www.example.com.
dig @127.0.0.1 www.example.com
If CoreDNS is configured to use the chaos plugin, you can alter the response. Let’s say you want to simulate a SERVFAIL (Server Failure) error. Your CoreDNS Corefile might look like this:
.:53 {
chaos
forward . 8.8.8.8
}
Now, when dig is run against this CoreDNS instance, instead of a valid IP address, it will receive a SERVFAIL.
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 12345
;; QUESTION SECTION:
;www.example.com. IN A
The chaos plugin intercepts DNS queries. Based on predefined rules or specific query patterns, it can return various error codes or malformed responses. This isn’t about actually breaking DNS; it’s about pretending to break it for testing purposes.
The primary problem this solves is validating the resilience of your applications and infrastructure. How does your Kubernetes cluster handle a DNS server that suddenly can’t resolve names? Does your ingress controller retry? Does your application gracefully degrade or error out? The chaos plugin lets you answer these questions before a real-world outage occurs.
Internally, the chaos plugin is a middleware in the CoreDNS request pipeline. When a query arrives, chaos gets a chance to inspect it. If the query matches a configured chaos rule (e.g., a specific domain name or query type), it can short-circuit the normal resolution process. Instead of passing the query to the next plugin (like forward or kubernetes), it crafts a specific DNS response.
Here are the levers you control:
chaosdirective: This is the basic plugin activation in yourCorefile.chaos { ... }block: This allows for more granular configuration.rcode <RCODE>: Forces a specific DNS response code (e.g.,NXDOMAIN,SERVFAIL,REFUSED).error <RCODE>: An alias forrcode.clientip <IP>: Changes the source IP address of the response to simulate a specific client. This is less about failure and more about response origin.delay <duration>: Introduces a latency to the response. This simulates network slowness or an overloaded server. For example,delay 500ms.name <pattern>: Applies chaos rules only to queries matching a specific domain name pattern. You can use wildcards. For example,name *.example.com.type <QTYPE>: Applies chaos rules only to queries of a specific DNS record type. For example,type A.log: Enables logging of chaos-induced responses.
Consider a scenario where you want to test how your application handles repeated NXDOMAIN responses for a specific subdomain. Your Corefile would look like this:
.:53 {
chaos {
name nx.example.com
rcode NXDOMAIN
log
}
forward . 8.8.8.8
}
Now, any query for nx.example.com will immediately return NXDOMAIN without hitting forward.
The most surprising mechanical detail is how chaos interacts with other plugins. If chaos is placed before the kubernetes plugin in your Corefile, it can prevent CoreDNS from even attempting to resolve internal Kubernetes service names if they match a chaos rule. This means you can simulate failures for internal DNS lookups as well, not just external ones.
.:53 {
chaos {
name *.svc.cluster.local
rcode SERVFAIL
log
}
kubernetes cluster.local in-addr.arpa ip6.arpa
forward . 8.8.8.8
}
With this configuration, any query for a service within your Kubernetes cluster (e.g., my-service.my-namespace.svc.cluster.local) will receive a SERVFAIL response.
The next step in chaos engineering DNS is to explore more advanced failure modes like DNS spoofing or cache poisoning simulations, which often require custom plugins or external tooling.