The real magic of tcpdump and Wireshark in distributed systems isn’t just seeing packets; it’s about spotting the absence of packets, or packets that arrive way too late, because that’s where the communication breakdowns happen.
Let’s say you have a microservice, service-a, that needs to fetch data from service-b. If service-a is hanging, you might think it’s a code bug in service-a. But often, the problem is that service-a sent its request to service-b, and service-b never got it, or service-b sent a response that never reached service-a.
Here’s a typical interaction: service-a (on 10.0.1.5) wants to talk to service-b (on 10.0.2.10).
# On service-a's host
sudo tcpdump -i eth0 host 10.0.2.10 -w /tmp/service-a-to-b.pcap
# On service-b's host
sudo tcpdump -i eth0 host 10.0.1.5 -w /tmp/service-b-from-a.pcap
Now, trigger the request from service-a. After a while, stop tcpdump (Ctrl+C) and grab the .pcap files. Open service-a-to-b.pcap in Wireshark on your local machine.
What you’re looking for first:
-
SYN, SYN-ACK, ACK: The TCP handshake. If
service-asends aSYN(request to connect) and never sees aSYN-ACK(acknowledgment and ready to connect) back from10.0.2.10, then the request never even leftservice-a’s network stack for the destination, or it’s being dropped by an intermediate firewall or router.- Diagnosis: Filter Wireshark for
tcp.flags.syn == 1 and tcp.flags.ack == 0(sent SYN) andtcp.flags.syn == 1 and tcp.flags.ack == 1(received SYN-ACK). See if the SYN from10.0.1.5has a corresponding SYN-ACK from10.0.2.10. - Common Cause 1: Firewall Blockage. A firewall (network appliance, security group in cloud,
iptableson the host) is silently dropping the SYN packets.- Fix: Check firewall rules on both
service-a’s host and any intermediate network devices. Foriptablesonservice-a’s host, ensure there’s anACCEPTrule for outgoing traffic on the target port (e.g.,sudo iptables -L OUTPUT -nv | grep 10.0.2.10). Onservice-b’s host, checksudo iptables -L INPUT -nv | grep 10.0.1.5. If using cloud security groups, verify ingress rules onservice-band egress rules onservice-aallow traffic on the specific port (e.g., TCP 8080). - Why it works: Firewalls act as gatekeepers; if the rule doesn’t allow passage, packets are discarded without notification. Explicitly allowing the traffic permits the handshake.
- Fix: Check firewall rules on both
- Common Cause 2: Network Routing Issue. The packet is sent, but there’s no route from
service-a’s network toservice-b’s network, or a router is misconfigured.- Fix: On
service-a’s host, runtraceroute 10.0.2.10ormtr 10.0.2.10. If it times out before reachingservice-b’s network, investigate the last known good hop’s routing configuration. This might involve checking the default gateway onservice-a(ip route show) and peering configurations between routers. - Why it works:
traceroute/mtrmaps the path packets take. If the path breaks, it points to the router responsible for forwarding traffic to the next segment. Correcting its routing table or adjacency allows packets to flow.
- Fix: On
- Common Cause 3: Host
service-bis Down or Unreachable.service-b’s host is offline, its network interface is down, or it’s simply not responding.- Fix: On
service-b’s host, check if the network interface is up (ip a) and if theservice-bprocess is running and listening on the correct port (sudo ss -tulnp | grep 8080). If the host is unreachable, check its power status and network connectivity. - Why it works: The
SYNpacket is a request to establish a connection. If the destination host or the listening process isn’t available, noSYN-ACKcan be sent back.
- Fix: On
- Common Cause 4: Network Address Translation (NAT) Mismatch. If NAT is involved (e.g., in cloud environments or corporate networks), the source IP address might be rewritten incorrectly, or the NAT device might not have a correct mapping for the return traffic.
- Fix: Examine NAT configurations on your edge routers or cloud NAT gateways. Ensure that the outbound traffic from
service-ais being translated with the correct public/internal IP, and that return traffic has a valid NAT mapping to reachservice-a. Usetcpdumpon the NAT device itself to see how it’s rewriting addresses. - Why it works: NAT modifies IP addresses. If this modification is wrong or incomplete, return packets won’t find their way back to the original source.
- Fix: Examine NAT configurations on your edge routers or cloud NAT gateways. Ensure that the outbound traffic from
- Diagnosis: Filter Wireshark for
-
Data Transfer (PSH, ACK): If the handshake completes (
SYN,SYN-ACK,ACKall present), but data doesn’t seem to arrive atservice-b(checkservice-b-from-a.pcapfor data packets,tcp.flags.push == 1), or responses don’t arrive atservice-a(checkservice-a-to-b.pcapfor response data).- Diagnosis: Filter Wireshark for
tcp.flags.push == 1. Look for the actual request payload fromservice-aand the response payload fromservice-b. You can also look attcp.analysis.retransmissionto see if packets are being resent, indicating loss. - Common Cause 5: Network Congestion/Packet Loss. High traffic volumes on the network path can lead to dropped packets, especially if buffers in routers or switches are full.
- Fix: Analyze
tcpdumpoutput for excessive retransmissions (tcp.analysis.retransmissionin Wireshark). If seen, this indicates packet loss. The fix involves identifying and mitigating the congestion source. This might mean increasing bandwidth, optimizing traffic flow, or implementing Quality of Service (QoS) to prioritize critical traffic. For a quick test, try sending traffic during off-peak hours. - Why it works: Congestion causes intermediate devices to drop packets. Retransmissions are the TCP layer’s way of recovering from this loss, but if loss is persistent, it cripples performance. Reducing load or prioritizing traffic ensures packets get through.
- Fix: Analyze
- Common Cause 6: Application-Level Hang/Slowdown on
service-b.service-breceives the request but is too busy, deadlocked, or waiting on another resource to process it and send a response.- Diagnosis: On
service-b’s host, runsudo tcpdump -i eth0 port 8080 -w /tmp/service-b-receive.pcapandsudo tcpdump -i eth0 host 10.0.1.5 -w /tmp/service-b-send.pcap. Ifservice-b-receive.pcapshows the incoming request packet butservice-b-send.pcapshows no outgoing response after a significant delay, the bottleneck is withinservice-b. - Fix: Investigate
service-b’s performance. Use application profiling tools, check its logs for errors or long-running operations, and monitor its CPU/memory usage (top,htop). The fix is application-specific: optimizing database queries, fixing deadlocks, scaling upservice-binstances, or improving its internal algorithms. - Why it works: Network tools can only show packets arriving and departing. If packets arrive but no response leaves, the problem is no longer network-level; it’s within the application’s processing.
- Diagnosis: On
- Diagnosis: Filter Wireshark for
-
FIN, RST: Clean connection termination (
FIN) or abrupt closure (RST). If you seeRSTpackets unexpectedly, it means a connection was forcibly closed.- Diagnosis: Filter for
tcp.flags.rst == 1. - Common Cause 7: Application Crash or Process Termination. If
service-bcrashes or its process is killed (e.g., by OOM killer), it will sendRSTpackets to any active connections.- Fix: Check system logs on
service-b’s host for crash messages (e.g.,sudo journalctl -xe | grep oom-killeror check/var/log/syslog,/var/log/messages). The fix involves addressing the root cause of the crash (memory leak, bug, resource exhaustion). - Why it works: An
RSTpacket is a direct instruction to terminate a connection immediately. If the application dies, it signals this to the OS, which then sendsRSTs.
- Fix: Check system logs on
- Diagnosis: Filter for
The next error you’ll likely hit after fixing these is a "connection timed out" if the other service is the one with the problem, or potentially a "5xx Server Error" if service-b is now reachable but failing to process requests correctly.