BIND is choking because it’s trying to create more file descriptors than the operating system will allow, usually due to a massive increase in DNS queries.

Common Causes and Fixes

  1. Insufficient OS File Descriptor Limit (ulimit -n)

    • Diagnosis: Run ulimit -n as the BIND user (often named) to see the current limit. Then, check the system-wide limits in /etc/security/limits.conf or /etc/security/limits.d/*. Look for lines like * soft nofile XXXX and * hard nofile XXXX.
    • Fix: Edit /etc/security/limits.conf and add or modify lines for the named user (or * if you want to affect all users, though specific is better):
      named soft nofile 65536
      named hard nofile 131072
      
      Then, ensure the PAM limits module is enabled by checking /etc/pam.d/system-auth (or similar) for a line like session required pam_limits.so.
    • Why it works: This directly raises the maximum number of file handles any single process can have open, which BIND needs for its sockets, zone files, and other internal operations. You need to reboot or restart the named service for these changes to take effect.
  2. BIND Configuration (max-files)

    • Diagnosis: Check your named.conf file for a max-files option. If it’s set too low, it can prematurely limit BIND’s file descriptor usage even if the OS limit is high.
    • Fix: In your named.conf (or included files), set max-files to a value that is at least the OS ulimit -n value, or ideally slightly higher to give BIND some headroom. A common setting is:
      options {
          directory "/var/named";
          // ... other options
          max-files 65536; // Or match your ulimit -n
      };
      
      Then, reload BIND: rndc reload.
    • Why it works: This is BIND’s internal mechanism to limit its own file descriptor usage, acting as a secondary safeguard. Setting it too low can cause it to stop accepting new connections or opening files before the OS limit is hit.
  3. Ephemeral Port Exhaustion (Less direct, but related)

    • Diagnosis: While not strictly "open files," a high volume of outgoing connections (e.g., for zone transfers or recursive lookups) can exhaust the available ephemeral ports. Check netstat -s | grep "out-of-sockets" or netstat -an | grep TIME_WAIT | wc -l.
    • Fix: Increase the range of ephemeral ports and decrease their TIME_WAIT timeout. Edit /etc/sysctl.conf:
      net.ipv4.ip_local_port_range = 1024 65535
      net.ipv4.tcp_fin_timeout = 30
      
      Apply with sysctl -p.
    • Why it works: A larger port range provides more available ports for outgoing connections, and a shorter tcp_fin_timeout allows connections in the TIME_WAIT state to be recycled faster, freeing up ports sooner.
  4. Excessive Zone Files or Cache Entries

    • Diagnosis: If you’re running a master for many large zones, or a recursive server with a huge cache, BIND will naturally open more files (zone files) and use more file descriptors for its internal cache structures. Check the number of .zone files in your BIND directory and estimate cache size.
    • Fix: For master zones, consider splitting large zones or optimizing zone file loading. For recursive servers, tune cache parameters in named.conf to manage memory usage and potentially reduce the number of active cache entries if memory is a constraint, though this is less about file descriptors directly.
    • Why it works: Each zone file needs to be opened and read. A very large number of active cache entries also consumes resources that can indirectly contribute to file descriptor pressure.
  5. Leaky File Descriptor Usage (Bug or Misconfiguration)

    • Diagnosis: Use lsof -p $(pgrep named) to see what files named has open. If you see an ever-increasing number of similar file types (especially sockets or pipes) that don’t seem to close, it might indicate a bug or a configuration issue causing resources to not be released.
    • Fix: This is the hardest to fix without deep analysis. It might involve upgrading BIND to the latest stable version, or carefully reviewing BIND’s logging configuration (logging statement in named.conf) to ensure it’s not overwhelming itself. Sometimes, specific acl or allow-query configurations can lead to unexpected connection storms.
    • Why it works: Identifying the specific file descriptor leak allows for targeted remediation, whether it’s a software bug fix, a configuration tweak, or a workaround.
  6. Systemd Service File Limits

    • Diagnosis: If your OS uses systemd, the named service might have its own file descriptor limits defined in its service unit file, which can override or conflict with /etc/security/limits.conf. Check the unit file, typically located at /usr/lib/systemd/system/named.service or /etc/systemd/system/named.service.d/override.conf. Look for LimitNOFILE=.
    • Fix: Edit the systemd service file (or create an override file in /etc/systemd/system/named.service.d/) to set a higher limit:
      [Service]
      LimitNOFILE=65536
      
      After editing, reload systemd daemon: systemctl daemon-reload, and then restart BIND: systemctl restart named.
    • Why it works: Systemd provides granular control over service resource limits, and its LimitNOFILE directive can impose a stricter limit than the system-wide ulimit settings if not configured correctly.

After fixing these, you’ll likely hit a network unreachable error if your DNSSEC validation is misconfigured or your upstream resolvers are down.

Want structured learning?

Take the full Dns course →