BIND is choking because it’s trying to create more file descriptors than the operating system will allow, usually due to a massive increase in DNS queries.
Common Causes and Fixes
-
Insufficient OS File Descriptor Limit (
ulimit -n)- Diagnosis: Run
ulimit -nas the BIND user (oftennamed) to see the current limit. Then, check the system-wide limits in/etc/security/limits.confor/etc/security/limits.d/*. Look for lines like* soft nofile XXXXand* hard nofile XXXX. - Fix: Edit
/etc/security/limits.confand add or modify lines for thenameduser (or*if you want to affect all users, though specific is better):
Then, ensure the PAMnamed soft nofile 65536 named hard nofile 131072limitsmodule is enabled by checking/etc/pam.d/system-auth(or similar) for a line likesession required pam_limits.so. - Why it works: This directly raises the maximum number of file handles any single process can have open, which BIND needs for its sockets, zone files, and other internal operations. You need to reboot or restart the
namedservice for these changes to take effect.
- Diagnosis: Run
-
BIND Configuration (
max-files)- Diagnosis: Check your
named.conffile for amax-filesoption. If it’s set too low, it can prematurely limit BIND’s file descriptor usage even if the OS limit is high. - Fix: In your
named.conf(or included files), setmax-filesto a value that is at least the OSulimit -nvalue, or ideally slightly higher to give BIND some headroom. A common setting is:
Then, reload BIND:options { directory "/var/named"; // ... other options max-files 65536; // Or match your ulimit -n };rndc reload. - Why it works: This is BIND’s internal mechanism to limit its own file descriptor usage, acting as a secondary safeguard. Setting it too low can cause it to stop accepting new connections or opening files before the OS limit is hit.
- Diagnosis: Check your
-
Ephemeral Port Exhaustion (Less direct, but related)
- Diagnosis: While not strictly "open files," a high volume of outgoing connections (e.g., for zone transfers or recursive lookups) can exhaust the available ephemeral ports. Check
netstat -s | grep "out-of-sockets"ornetstat -an | grep TIME_WAIT | wc -l. - Fix: Increase the range of ephemeral ports and decrease their TIME_WAIT timeout. Edit
/etc/sysctl.conf:
Apply withnet.ipv4.ip_local_port_range = 1024 65535 net.ipv4.tcp_fin_timeout = 30sysctl -p. - Why it works: A larger port range provides more available ports for outgoing connections, and a shorter
tcp_fin_timeoutallows connections in theTIME_WAITstate to be recycled faster, freeing up ports sooner.
- Diagnosis: While not strictly "open files," a high volume of outgoing connections (e.g., for zone transfers or recursive lookups) can exhaust the available ephemeral ports. Check
-
Excessive Zone Files or Cache Entries
- Diagnosis: If you’re running a master for many large zones, or a recursive server with a huge cache, BIND will naturally open more files (zone files) and use more file descriptors for its internal cache structures. Check the number of
.zonefiles in your BIND directory and estimate cache size. - Fix: For master zones, consider splitting large zones or optimizing zone file loading. For recursive servers, tune cache parameters in
named.confto manage memory usage and potentially reduce the number of active cache entries if memory is a constraint, though this is less about file descriptors directly. - Why it works: Each zone file needs to be opened and read. A very large number of active cache entries also consumes resources that can indirectly contribute to file descriptor pressure.
- Diagnosis: If you’re running a master for many large zones, or a recursive server with a huge cache, BIND will naturally open more files (zone files) and use more file descriptors for its internal cache structures. Check the number of
-
Leaky File Descriptor Usage (Bug or Misconfiguration)
- Diagnosis: Use
lsof -p $(pgrep named)to see what filesnamedhas open. If you see an ever-increasing number of similar file types (especially sockets or pipes) that don’t seem to close, it might indicate a bug or a configuration issue causing resources to not be released. - Fix: This is the hardest to fix without deep analysis. It might involve upgrading BIND to the latest stable version, or carefully reviewing BIND’s logging configuration (
loggingstatement innamed.conf) to ensure it’s not overwhelming itself. Sometimes, specificaclorallow-queryconfigurations can lead to unexpected connection storms. - Why it works: Identifying the specific file descriptor leak allows for targeted remediation, whether it’s a software bug fix, a configuration tweak, or a workaround.
- Diagnosis: Use
-
Systemd Service File Limits
- Diagnosis: If your OS uses systemd, the
namedservice might have its own file descriptor limits defined in its service unit file, which can override or conflict with/etc/security/limits.conf. Check the unit file, typically located at/usr/lib/systemd/system/named.serviceor/etc/systemd/system/named.service.d/override.conf. Look forLimitNOFILE=. - Fix: Edit the systemd service file (or create an override file in
/etc/systemd/system/named.service.d/) to set a higher limit:
After editing, reload systemd daemon:[Service] LimitNOFILE=65536systemctl daemon-reload, and then restart BIND:systemctl restart named. - Why it works: Systemd provides granular control over service resource limits, and its
LimitNOFILEdirective can impose a stricter limit than the system-wideulimitsettings if not configured correctly.
- Diagnosis: If your OS uses systemd, the
After fixing these, you’ll likely hit a network unreachable error if your DNSSEC validation is misconfigured or your upstream resolvers are down.