DNS resolution times are a surprisingly poor indicator of actual user-facing latency.
Let’s see this in action. Imagine a user trying to reach www.example.com. Their browser first needs to figure out the IP address for that hostname. This involves a DNS lookup.
dig www.example.com
Here’s what a typical output looks like:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.example.com. IN A
;; ANSWER SECTION:
www.example.com. 300 IN A 93.184.216.34
;; Query time: 25 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Tue Jan 01 12:00:00 UTC 2024
;; MSG SIZE rcvd: 59
The Query time: 25 msec is what we’re interested in. This is the time it took for our local DNS resolver (in this case, 192.168.1.1) to get the answer from an upstream DNS server.
Now, why is this misleading for user experience? Because the next step, connecting to 93.184.216.34, can take much longer, and that’s what the user actually experiences. The DNS lookup might be lightning fast, but if the IP address returned points to a server on the other side of the world with a slow network path, the user will perceive high latency.
To truly monitor DNS resolution times and set up alerts, we need a system that can periodically query DNS records and measure the total time from the client’s perspective. We’ll use dig for querying and a simple shell script to automate this and check against thresholds.
First, let’s set up a script that performs the lookup and extracts the query time. We’ll target a reliable, well-known domain like google.com to avoid issues with the domain itself being slow to respond.
#!/bin/bash
DOMAIN="google.com"
RESOLVER="8.8.8.8" # Google's Public DNS
# Perform the DNS lookup and extract the query time
QUERY_TIME=$(dig @"$RESOLVER" "$DOMAIN" +short +time=1 | grep "Query time:" | awk '{print $3}')
# Check if QUERY_TIME was successfully extracted
if [[ -z "$QUERY_TIME" ]]; then
echo "Error: Could not retrieve DNS query time for $DOMAIN via $RESOLVER."
exit 1
fi
echo "DNS query time for $DOMAIN via $RESOLVER: $QUERY_TIME msec"
# Set your alert threshold (in milliseconds)
ALERT_THRESHOLD=100
if (( QUERY_TIME > ALERT_THRESHOLD )); then
echo "ALERT: DNS resolution time ($QUERY_TIME msec) exceeds threshold ($ALERT_THRESHOLD msec)!"
# In a real-world scenario, you'd send an email, Slack message, PagerDuty alert, etc.
# mail -s "DNS Alert: High Resolution Time" admin@example.com <<< "DNS resolution time for $DOMAIN via $RESOLVER is $QUERY_TIME msec, exceeding threshold of $ALERT_THRESHOLD msec."
else
echo "DNS resolution time is within acceptable limits."
fi
Save this script as check_dns.sh, make it executable (chmod +x check_dns.sh), and run it. This script queries Google’s public DNS server (8.8.8.8) for google.com and checks if the reported query time exceeds 100 milliseconds.
To make this a robust monitoring solution, you’ll want to:
-
Run it periodically: Use
cronto schedulecheck_dns.shto run every 5 minutes. Add an entry to your crontab:crontab -e */5 * * * * /path/to/your/check_dns.sh >> /var/log/dns_check.log 2>&1This ensures continuous monitoring.
-
Monitor from multiple locations: DNS resolution times can vary significantly based on your network’s proximity and peering to DNS servers. Deploy this script (or a similar check) on servers in different geographic regions or data centers.
-
Alerting mechanism: The script includes a commented-out
mailcommand. Integrate this with your actual alerting system (e.g., PagerDuty, Opsgenie, Slack webhooks) to notify the right people when an alert is triggered. -
Track historical data: Instead of just alerting, send the
QUERY_TIMEto a time-series database like Prometheus or InfluxDB. This allows you to visualize trends and set more sophisticated alerts based on moving averages or percentile changes. For Prometheus, you could usenode_exporter’s textfile collector or a custom exporter.
The common mistake is to only check DNS resolution from one internal server. This doesn’t tell you if users outside your network are experiencing slow lookups. The dig command’s Query time metric is only the time taken by the resolver to respond to your query, not the entire round trip from the end-user’s machine to the authoritative DNS server and back, nor does it account for the time it takes to establish a connection to the IP address returned.
The next problem you’ll encounter is distinguishing between slow DNS resolution and slow application response times.