The Datadog Forwarder failed to send logs to the Datadog API because an upstream network component rejected its connection attempts too many times.
This usually means your Datadog Forwarder, a small agent running on your hosts, can’t reach the Datadog API endpoints. It’s not that Datadog is down, but that the network path between your hosts and Datadog is blocked or misconfigured.
Cause 1: Network ACLs or Security Groups Blocking Outbound Traffic
Your cloud provider’s network access control lists (ACLs) or security groups are likely preventing the Forwarder from making outbound HTTPS (port 443) connections to Datadog’s IP ranges.
- Diagnosis: On your AWS EC2 instance, check your Security Group rules associated with the instance. On GCP, check your Firewall rules. Look for rules that allow
0.0.0.0/0(or specific Datadog IPs) on TCP port443. - Fix: Add or modify your security group/firewall rules to explicitly allow outbound TCP traffic on port 443 to Datadog’s IP ranges. For example, in AWS, you’d edit the Security Group attached to your EC2 instances and add an outbound rule: Type:
Custom TCP, Port Range:443, Destination:0.0.0.0/0. If you know Datadog’s IPs, you can restrict it further, but0.0.0.0/0is common for general outbound access. - Why it works: This explicitly permits the Forwarder agent to initiate connections to Datadog’s API servers over the internet.
Cause 2: Proxy Configuration Issues
If your network requires outbound traffic to go through a proxy server, the Datadog Forwarder might not be configured to use it, or the proxy itself might be misconfigured or overloaded.
- Diagnosis: Check the Datadog Forwarder configuration file (usually
/etc/datadog-agent/datadog.yamlon Linux or within the agent’s installation directory on Windows). Look forhttp_proxyandhttps_proxysettings. Also, check your proxy server’s logs for connection refusals or errors related to Datadog’s API endpoints (e.g.,api.datadoghq.com). - Fix: Ensure the
https_proxyandhttp_proxysettings indatadog.yamlpoint to your valid proxy server and port. Example:
If the proxy is the issue, you’ll need to troubleshoot your proxy server itself.# In datadog.yaml https_proxy: http://your-proxy.example.com:8080 http_proxy: http://your-proxy.example.com:8080 - Why it works: This directs the Forwarder’s outbound requests through the designated proxy, which then forwards them to Datadog’s API.
Cause 3: DNS Resolution Problems
The Forwarder might be unable to resolve Datadog’s API endpoints (e.g., api.datadoghq.com) to their correct IP addresses.
- Diagnosis: On the host running the Forwarder, try to ping or
curlthe Datadog API endpoint:
If these commands fail with "unknown host" or similar, DNS is the problem. Check your host’sping api.datadoghq.com # or curl -v https://api.datadoghq.com/etc/resolv.conf(Linux) or network adapter settings (Windows) to ensure it’s pointing to valid DNS servers. - Fix: Correct your host’s DNS configuration to use resolvable DNS servers. For example, update
/etc/resolv.confto point to Google’s DNS (8.8.8.8) or your internal DNS server:
Then restart the Datadog agent.nameserver 8.8.8.8 nameserver 8.8.4.4 - Why it works: Correct DNS resolution provides the Forwarder with the actual IP addresses it needs to connect to Datadog’s servers.
Cause 4: Datadog Agent Not Running or Malfunctioning
While less common for "too many retries" (more likely for "connection refused" or "host unreachable"), a severely degraded agent state could manifest this way.
- Diagnosis: Check the agent’s status:
Look for any errors reported in the output or in the agent’s logs (sudo datadog-agent status/var/log/datadog/agent.logor similar). - Fix: Restart the Datadog agent:
If issues persist, consider reinstalling the agent.sudo datadog-agent restart - Why it works: A fresh restart can clear internal state corruption or resource exhaustion within the agent process.
Cause 5: Network Latency or Packet Loss
High network latency or consistent packet loss between your host and Datadog’s API endpoints can cause TCP connection attempts to time out, leading to retries and eventual failure.
- Diagnosis: Use
mtr(My Traceroute) orpingwith a larger packet size for an extended period to check for packet loss and latency toapi.datadoghq.com.
Look for any hops showing significant packet loss (above 1-2%) or consistently high latency.mtr --report api.datadoghq.com - Fix: This often requires network infrastructure troubleshooting. You might need to involve your network team to identify and resolve routing issues, congestion, or firewall performance problems along the path to Datadog. There’s no direct agent-level fix.
- Why it works: Reducing latency and eliminating packet loss allows TCP connections to establish reliably and quickly.
Cause 6: Datadog API Rate Limiting (Less Common for this Specific Error)
While "Too Many Retries" usually points to network issues before reaching Datadog, in rare cases, if all your agents are hitting Datadog simultaneously and exceeding their allocated rate limits, Datadog might start rejecting connections temporarily. This is more likely to manifest as 429 Too Many Requests errors in agent logs if you could inspect them directly, but the forwarding layer might abstract it to retries.
- Diagnosis: Check your Datadog account’s usage and limits. Look for any notifications or alerts within Datadog about exceeding intake limits. This is very unlikely to be the primary cause of the "Too Many Retries" error seen at the network layer.
- Fix: If this is the case, you’ll need to scale up your Datadog plan or investigate why your agents are sending an excessive volume of data. This might involve filtering logs at the source or adjusting collection configurations.
- Why it works: Ensuring your data volume stays within your Datadog plan limits prevents the API from throttling your requests.
After resolving these, you might encounter ERR: No such host if DNS is still problematic, or ERR: dial tcp: lookup api.datadoghq.com: no such host if the issue is purely DNS resolution.