Network Debugging: Beyond Ping

The real magic of tcpdump and Wireshark in distributed systems isn’t just seeing packets; it’s about spotting the absence of packets, or packets that arrive way too late, because that’s where the communication breakdowns happen.

Let’s say you have a microservice, service-a, that needs to fetch data from service-b. If service-a is hanging, you might think it’s a code bug in service-a. But often, the problem is that service-a sent its request to service-b, and service-b never got it, or service-b sent a response that never reached service-a.

Here’s a typical interaction: service-a (on 10.0.1.5) wants to talk to service-b (on 10.0.2.10).

# On service-a's host
sudo tcpdump -i eth0 host 10.0.2.10 -w /tmp/service-a-to-b.pcap

# On service-b's host
sudo tcpdump -i eth0 host 10.0.1.5 -w /tmp/service-b-from-a.pcap

Now, trigger the request from service-a. After a while, stop tcpdump (Ctrl+C) and grab the .pcap files. Open service-a-to-b.pcap in Wireshark on your local machine.

What you’re looking for first:

SYN, SYN-ACK, ACK: The TCP handshake. If service-a sends a SYN (request to connect) and never sees a SYN-ACK (acknowledgment and ready to connect) back from 10.0.2.10, then the request never even left service-a’s network stack for the destination, or it’s being dropped by an intermediate firewall or router.
- Diagnosis: Filter Wireshark for tcp.flags.syn == 1 and tcp.flags.ack == 0 (sent SYN) and tcp.flags.syn == 1 and tcp.flags.ack == 1 (received SYN-ACK). See if the SYN from 10.0.1.5 has a corresponding SYN-ACK from 10.0.2.10.
- Common Cause 1: Firewall Blockage. A firewall (network appliance, security group in cloud, iptables on the host) is silently dropping the SYN packets.
  - Fix: Check firewall rules on both service-a’s host and any intermediate network devices. For iptables on service-a’s host, ensure there’s an ACCEPT rule for outgoing traffic on the target port (e.g., sudo iptables -L OUTPUT -nv | grep 10.0.2.10). On service-b’s host, check sudo iptables -L INPUT -nv | grep 10.0.1.5. If using cloud security groups, verify ingress rules on service-b and egress rules on service-a allow traffic on the specific port (e.g., TCP 8080).
  - Why it works: Firewalls act as gatekeepers; if the rule doesn’t allow passage, packets are discarded without notification. Explicitly allowing the traffic permits the handshake.
- Common Cause 2: Network Routing Issue. The packet is sent, but there’s no route from service-a’s network to service-b’s network, or a router is misconfigured.
  - Fix: On service-a’s host, run traceroute 10.0.2.10 or mtr 10.0.2.10. If it times out before reaching service-b’s network, investigate the last known good hop’s routing configuration. This might involve checking the default gateway on service-a (ip route show) and peering configurations between routers.
  - Why it works: traceroute/mtr maps the path packets take. If the path breaks, it points to the router responsible for forwarding traffic to the next segment. Correcting its routing table or adjacency allows packets to flow.
- Common Cause 3: Host service-b is Down or Unreachable. service-b’s host is offline, its network interface is down, or it’s simply not responding.
  - Fix: On service-b’s host, check if the network interface is up (ip a) and if the service-b process is running and listening on the correct port (sudo ss -tulnp | grep 8080). If the host is unreachable, check its power status and network connectivity.
  - Why it works: The SYN packet is a request to establish a connection. If the destination host or the listening process isn’t available, no SYN-ACK can be sent back.
- Common Cause 4: Network Address Translation (NAT) Mismatch. If NAT is involved (e.g., in cloud environments or corporate networks), the source IP address might be rewritten incorrectly, or the NAT device might not have a correct mapping for the return traffic.
  - Fix: Examine NAT configurations on your edge routers or cloud NAT gateways. Ensure that the outbound traffic from service-a is being translated with the correct public/internal IP, and that return traffic has a valid NAT mapping to reach service-a. Use tcpdump on the NAT device itself to see how it’s rewriting addresses.
  - Why it works: NAT modifies IP addresses. If this modification is wrong or incomplete, return packets won’t find their way back to the original source.
Data Transfer (PSH, ACK): If the handshake completes (SYN, SYN-ACK, ACK all present), but data doesn’t seem to arrive at service-b (check service-b-from-a.pcap for data packets, tcp.flags.push == 1), or responses don’t arrive at service-a (check service-a-to-b.pcap for response data).
- Diagnosis: Filter Wireshark for tcp.flags.push == 1. Look for the actual request payload from service-a and the response payload from service-b. You can also look at tcp.analysis.retransmission to see if packets are being resent, indicating loss.
- Common Cause 5: Network Congestion/Packet Loss. High traffic volumes on the network path can lead to dropped packets, especially if buffers in routers or switches are full.
  - Fix: Analyze tcpdump output for excessive retransmissions (tcp.analysis.retransmission in Wireshark). If seen, this indicates packet loss. The fix involves identifying and mitigating the congestion source. This might mean increasing bandwidth, optimizing traffic flow, or implementing Quality of Service (QoS) to prioritize critical traffic. For a quick test, try sending traffic during off-peak hours.
  - Why it works: Congestion causes intermediate devices to drop packets. Retransmissions are the TCP layer’s way of recovering from this loss, but if loss is persistent, it cripples performance. Reducing load or prioritizing traffic ensures packets get through.
- Common Cause 6: Application-Level Hang/Slowdown on service-b. service-b receives the request but is too busy, deadlocked, or waiting on another resource to process it and send a response.
  - Diagnosis: On service-b’s host, run sudo tcpdump -i eth0 port 8080 -w /tmp/service-b-receive.pcap and sudo tcpdump -i eth0 host 10.0.1.5 -w /tmp/service-b-send.pcap. If service-b-receive.pcap shows the incoming request packet but service-b-send.pcap shows no outgoing response after a significant delay, the bottleneck is within service-b.
  - Fix: Investigate service-b’s performance. Use application profiling tools, check its logs for errors or long-running operations, and monitor its CPU/memory usage (top, htop). The fix is application-specific: optimizing database queries, fixing deadlocks, scaling up service-b instances, or improving its internal algorithms.
  - Why it works: Network tools can only show packets arriving and departing. If packets arrive but no response leaves, the problem is no longer network-level; it’s within the application’s processing.
FIN, RST: Clean connection termination (FIN) or abrupt closure (RST). If you see RST packets unexpectedly, it means a connection was forcibly closed.
- Diagnosis: Filter for tcp.flags.rst == 1.
- Common Cause 7: Application Crash or Process Termination. If service-b crashes or its process is killed (e.g., by OOM killer), it will send RST packets to any active connections.
  - Fix: Check system logs on service-b’s host for crash messages (e.g., sudo journalctl -xe | grep oom-killer or check /var/log/syslog, /var/log/messages). The fix involves addressing the root cause of the crash (memory leak, bug, resource exhaustion).
  - Why it works: An RST packet is a direct instruction to terminate a connection immediately. If the application dies, it signals this to the OS, which then sends RSTs.

The next error you’ll likely hit after fixing these is a "connection timed out" if the other service is the one with the problem, or potentially a "5xx Server Error" if service-b is now reachable but failing to process requests correctly.