Respond to Production Incidents Faster with a Structured Runbook (2026)

A runbook is your incident response cheat sheet, but most are so poorly written they’re useless when you’re actually in a crisis.

Let’s say a critical service is intermittently failing, manifesting as 500 errors for users and intermittent timeouts in internal service-to-service communication. The core issue is that the user-service is failing to connect to the profile-db due to network segmentation rules.

Common Causes and Fixes for `user-service` to `profile-db` Connection Failures

Incorrect Network Policy (Most Common): A recent change to network policies has inadvertently blocked traffic from the user-service pods to the profile-db pods.
- Diagnosis:
```
kubectl get networkpolicies -n default | grep user-service
kubectl get networkpolicies -n default | grep profile-db
```
  Look for policies that might be too restrictive on egress from user-service or ingress to profile-db.
- Fix: Modify the existing networkpolicy for user-service to explicitly allow egress to the profile-db’s namespace and port. For example, if your policy is named user-service-egress-policy:
```
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: user-service-egress-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: user-service
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: profile-db
    ports:
    - protocol: TCP
      port: 5432 # Or your PostgreSQL port
```
- Why it works: This explicitly re-allows the necessary TCP traffic on port 5432 from pods labeled app: user-service to pods labeled app: profile-db within the default namespace, bypassing the unintended block.
Service Discovery Issues (DNS/CoreDNS): The user-service cannot resolve the DNS name for profile-db, leading to connection attempts to an invalid address.
- Diagnosis: Exec into a user-service pod and try to nslookup or dig the profile-db service:
```
kubectl exec -it <user-service-pod-name> -n default -- nslookup profile-db.default.svc.cluster.local
```
  If this fails or returns an incorrect IP, CoreDNS might be the culprit. Check CoreDNS pod logs:
```
kubectl logs -n kube-system <coredns-pod-name>
```
- Fix: If CoreDNS is unresponsive or has errors, restart the CoreDNS pods:
```
kubectl delete pod <coredns-pod-name-1> <coredns-pod-name-2> -n kube-system
```
  Kubernetes will automatically recreate them.
- Why it works: Restarting CoreDNS pods forces them to re-register and re-establish their upstream connections, resolving transient DNS resolution failures.
Resource Exhaustion (CPU/Memory on DB Pod): The profile-db pod is overwhelmed, refusing new connections or becoming unresponsive.
- Diagnosis: Check the resource utilization of the profile-db pod:
```
kubectl top pod <profile-db-pod-name> -n default
```
  Also, check the database logs for errors related to connection limits or resource starvation.
- Fix: Increase the CPU and memory limits for the profile-db deployment. For example, if the deployment is named profile-db-deployment:
```
# ... within the deployment spec.template.spec.containers section ...
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1000m" # Increased from e.g., "500m"
    memory: "2Gi" # Increased from e.g., "1Gi"
```
  Then, scale up the profile-db deployment if it’s a stateful set, or ensure it has enough replicas.
- Why it works: Providing more resources allows the database to handle incoming connections and process queries without becoming overloaded, thus accepting new connections.
Database Connection Pool Exhaustion: The profile-db itself has reached its maximum configured number of client connections.
- Diagnosis: Connect to the profile-db (e.g., using psql if it’s PostgreSQL) and run:
```
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity; -- For PostgreSQL
```
  If count(*) is close to max_connections, this is the issue.
- Fix: Increase max_connections in the profile-db’s PostgreSQL configuration (e.g., postgresql.conf or via ALTER SYSTEM SET). You’ll likely need to restart the profile-db pods for this change to take effect.
```
# Example for PostgreSQL, requires access to config files or ALTER SYSTEM
# ALTER SYSTEM SET max_connections = 200;
# SELECT pg_reload_conf(); -- Or restart the pods
```
- Why it works: Increasing the connection limit allows more simultaneous connections from applications like user-service, preventing them from being rejected due to pool exhaustion.
Firewall/Security Group Blocking: External firewalls or cloud provider security groups are blocking traffic between the Kubernetes cluster’s nodes and the profile-db (if it’s external) or between nodes.
- Diagnosis: Check cloud provider firewall rules (e.g., AWS Security Groups, Azure Network Security Groups) or on-premise firewall configurations. Test connectivity from a node running user-service to the profile-db IP/port using telnet or nc.
```
# From a node in your cluster
telnet <profile-db-external-ip> 5432
```
- Fix: Update firewall/security group rules to allow TCP traffic on port 5432 from the Kubernetes node IP range or the specific egress IPs of your cluster to the profile-db.
- Why it works: This ensures that the network path between the services is open at the infrastructure level, allowing the TCP connection to be established.
Incorrect Database Credentials/Authentication: The user-service is using incorrect credentials to connect to profile-db, leading to authentication failures that might be logged as connection issues.
- Diagnosis: Check the logs of the user-service for authentication errors. Verify the database credentials stored in Kubernetes Secrets.
```
kubectl get secret <user-service-db-secret-name> -n default -o yaml
```
  Compare these with the actual credentials for profile-db.
- Fix: Update the Kubernetes Secret with the correct database username and password.
```
kubectl edit secret <user-service-db-secret-name> -n default
```
  Then, restart the user-service pods to pick up the updated credentials.
- Why it works: Using the correct credentials allows the user-service to successfully authenticate with the profile-db, resolving the connection refusal.

After fixing the network policy, you might immediately encounter issues with user-service failing to start up because its health check probe is now timing out due to a new, more aggressive readiness probe configuration.

Common Causes and Fixes for user-service to profile-db Connection Failures

Common Causes and Fixes for `user-service` to `profile-db` Connection Failures