A runbook is your incident response cheat sheet, but most are so poorly written they’re useless when you’re actually in a crisis.
Let’s say a critical service is intermittently failing, manifesting as 500 errors for users and intermittent timeouts in internal service-to-service communication. The core issue is that the user-service is failing to connect to the profile-db due to network segmentation rules.
Common Causes and Fixes for user-service to profile-db Connection Failures
-
Incorrect Network Policy (Most Common): A recent change to network policies has inadvertently blocked traffic from the
user-servicepods to theprofile-dbpods.- Diagnosis:
Look for policies that might be too restrictive on egress fromkubectl get networkpolicies -n default | grep user-service kubectl get networkpolicies -n default | grep profile-dbuser-serviceor ingress toprofile-db. - Fix:
Modify the existing
networkpolicyforuser-serviceto explicitly allow egress to theprofile-db’s namespace and port. For example, if your policy is nameduser-service-egress-policy:apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: user-service-egress-policy namespace: default spec: podSelector: matchLabels: app: user-service policyTypes: - Egress egress: - to: - podSelector: matchLabels: app: profile-db ports: - protocol: TCP port: 5432 # Or your PostgreSQL port - Why it works: This explicitly re-allows the necessary TCP traffic on port 5432 from pods labeled
app: user-serviceto pods labeledapp: profile-dbwithin thedefaultnamespace, bypassing the unintended block.
- Diagnosis:
-
Service Discovery Issues (DNS/CoreDNS): The
user-servicecannot resolve the DNS name forprofile-db, leading to connection attempts to an invalid address.- Diagnosis:
Exec into a
user-servicepod and try tonslookupordigtheprofile-dbservice:
If this fails or returns an incorrect IP, CoreDNS might be the culprit. Check CoreDNS pod logs:kubectl exec -it <user-service-pod-name> -n default -- nslookup profile-db.default.svc.cluster.localkubectl logs -n kube-system <coredns-pod-name> - Fix:
If CoreDNS is unresponsive or has errors, restart the CoreDNS pods:
Kubernetes will automatically recreate them.kubectl delete pod <coredns-pod-name-1> <coredns-pod-name-2> -n kube-system - Why it works: Restarting CoreDNS pods forces them to re-register and re-establish their upstream connections, resolving transient DNS resolution failures.
- Diagnosis:
Exec into a
-
Resource Exhaustion (CPU/Memory on DB Pod): The
profile-dbpod is overwhelmed, refusing new connections or becoming unresponsive.- Diagnosis:
Check the resource utilization of the
profile-dbpod:
Also, check the database logs for errors related to connection limits or resource starvation.kubectl top pod <profile-db-pod-name> -n default - Fix:
Increase the CPU and memory limits for the
profile-dbdeployment. For example, if the deployment is namedprofile-db-deployment:
Then, scale up the# ... within the deployment spec.template.spec.containers section ... resources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "1000m" # Increased from e.g., "500m" memory: "2Gi" # Increased from e.g., "1Gi"profile-dbdeployment if it’s a stateful set, or ensure it has enough replicas. - Why it works: Providing more resources allows the database to handle incoming connections and process queries without becoming overloaded, thus accepting new connections.
- Diagnosis:
Check the resource utilization of the
-
Database Connection Pool Exhaustion: The
profile-dbitself has reached its maximum configured number of client connections.- Diagnosis:
Connect to the
profile-db(e.g., usingpsqlif it’s PostgreSQL) and run:
IfSHOW max_connections; SELECT count(*) FROM pg_stat_activity; -- For PostgreSQLcount(*)is close tomax_connections, this is the issue. - Fix:
Increase
max_connectionsin theprofile-db’s PostgreSQL configuration (e.g.,postgresql.confor viaALTER SYSTEM SET). You’ll likely need to restart theprofile-dbpods for this change to take effect.# Example for PostgreSQL, requires access to config files or ALTER SYSTEM # ALTER SYSTEM SET max_connections = 200; # SELECT pg_reload_conf(); -- Or restart the pods - Why it works: Increasing the connection limit allows more simultaneous connections from applications like
user-service, preventing them from being rejected due to pool exhaustion.
- Diagnosis:
Connect to the
-
Firewall/Security Group Blocking: External firewalls or cloud provider security groups are blocking traffic between the Kubernetes cluster’s nodes and the
profile-db(if it’s external) or between nodes.- Diagnosis:
Check cloud provider firewall rules (e.g., AWS Security Groups, Azure Network Security Groups) or on-premise firewall configurations. Test connectivity from a node running
user-serviceto theprofile-dbIP/port usingtelnetornc.# From a node in your cluster telnet <profile-db-external-ip> 5432 - Fix:
Update firewall/security group rules to allow TCP traffic on port 5432 from the Kubernetes node IP range or the specific egress IPs of your cluster to the
profile-db. - Why it works: This ensures that the network path between the services is open at the infrastructure level, allowing the TCP connection to be established.
- Diagnosis:
Check cloud provider firewall rules (e.g., AWS Security Groups, Azure Network Security Groups) or on-premise firewall configurations. Test connectivity from a node running
-
Incorrect Database Credentials/Authentication: The
user-serviceis using incorrect credentials to connect toprofile-db, leading to authentication failures that might be logged as connection issues.- Diagnosis:
Check the logs of the
user-servicefor authentication errors. Verify the database credentials stored in Kubernetes Secrets.
Compare these with the actual credentials forkubectl get secret <user-service-db-secret-name> -n default -o yamlprofile-db. - Fix:
Update the Kubernetes Secret with the correct database username and password.
Then, restart thekubectl edit secret <user-service-db-secret-name> -n defaultuser-servicepods to pick up the updated credentials. - Why it works: Using the correct credentials allows the
user-serviceto successfully authenticate with theprofile-db, resolving the connection refusal.
- Diagnosis:
Check the logs of the
After fixing the network policy, you might immediately encounter issues with user-service failing to start up because its health check probe is now timing out due to a new, more aggressive readiness probe configuration.