A runbook is your incident response cheat sheet, but most are so poorly written they’re useless when you’re actually in a crisis.

Let’s say a critical service is intermittently failing, manifesting as 500 errors for users and intermittent timeouts in internal service-to-service communication. The core issue is that the user-service is failing to connect to the profile-db due to network segmentation rules.

Common Causes and Fixes for user-service to profile-db Connection Failures

  1. Incorrect Network Policy (Most Common): A recent change to network policies has inadvertently blocked traffic from the user-service pods to the profile-db pods.

    • Diagnosis:
      kubectl get networkpolicies -n default | grep user-service
      kubectl get networkpolicies -n default | grep profile-db
      
      Look for policies that might be too restrictive on egress from user-service or ingress to profile-db.
    • Fix: Modify the existing networkpolicy for user-service to explicitly allow egress to the profile-db’s namespace and port. For example, if your policy is named user-service-egress-policy:
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: user-service-egress-policy
        namespace: default
      spec:
        podSelector:
          matchLabels:
            app: user-service
        policyTypes:
        - Egress
        egress:
        - to:
          - podSelector:
              matchLabels:
                app: profile-db
          ports:
          - protocol: TCP
            port: 5432 # Or your PostgreSQL port
      
    • Why it works: This explicitly re-allows the necessary TCP traffic on port 5432 from pods labeled app: user-service to pods labeled app: profile-db within the default namespace, bypassing the unintended block.
  2. Service Discovery Issues (DNS/CoreDNS): The user-service cannot resolve the DNS name for profile-db, leading to connection attempts to an invalid address.

    • Diagnosis: Exec into a user-service pod and try to nslookup or dig the profile-db service:
      kubectl exec -it <user-service-pod-name> -n default -- nslookup profile-db.default.svc.cluster.local
      
      If this fails or returns an incorrect IP, CoreDNS might be the culprit. Check CoreDNS pod logs:
      kubectl logs -n kube-system <coredns-pod-name>
      
    • Fix: If CoreDNS is unresponsive or has errors, restart the CoreDNS pods:
      kubectl delete pod <coredns-pod-name-1> <coredns-pod-name-2> -n kube-system
      
      Kubernetes will automatically recreate them.
    • Why it works: Restarting CoreDNS pods forces them to re-register and re-establish their upstream connections, resolving transient DNS resolution failures.
  3. Resource Exhaustion (CPU/Memory on DB Pod): The profile-db pod is overwhelmed, refusing new connections or becoming unresponsive.

    • Diagnosis: Check the resource utilization of the profile-db pod:
      kubectl top pod <profile-db-pod-name> -n default
      
      Also, check the database logs for errors related to connection limits or resource starvation.
    • Fix: Increase the CPU and memory limits for the profile-db deployment. For example, if the deployment is named profile-db-deployment:
      # ... within the deployment spec.template.spec.containers section ...
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "1000m" # Increased from e.g., "500m"
          memory: "2Gi" # Increased from e.g., "1Gi"
      
      Then, scale up the profile-db deployment if it’s a stateful set, or ensure it has enough replicas.
    • Why it works: Providing more resources allows the database to handle incoming connections and process queries without becoming overloaded, thus accepting new connections.
  4. Database Connection Pool Exhaustion: The profile-db itself has reached its maximum configured number of client connections.

    • Diagnosis: Connect to the profile-db (e.g., using psql if it’s PostgreSQL) and run:
      SHOW max_connections;
      SELECT count(*) FROM pg_stat_activity; -- For PostgreSQL
      
      If count(*) is close to max_connections, this is the issue.
    • Fix: Increase max_connections in the profile-db’s PostgreSQL configuration (e.g., postgresql.conf or via ALTER SYSTEM SET). You’ll likely need to restart the profile-db pods for this change to take effect.
      # Example for PostgreSQL, requires access to config files or ALTER SYSTEM
      # ALTER SYSTEM SET max_connections = 200;
      # SELECT pg_reload_conf(); -- Or restart the pods
      
    • Why it works: Increasing the connection limit allows more simultaneous connections from applications like user-service, preventing them from being rejected due to pool exhaustion.
  5. Firewall/Security Group Blocking: External firewalls or cloud provider security groups are blocking traffic between the Kubernetes cluster’s nodes and the profile-db (if it’s external) or between nodes.

    • Diagnosis: Check cloud provider firewall rules (e.g., AWS Security Groups, Azure Network Security Groups) or on-premise firewall configurations. Test connectivity from a node running user-service to the profile-db IP/port using telnet or nc.
      # From a node in your cluster
      telnet <profile-db-external-ip> 5432
      
    • Fix: Update firewall/security group rules to allow TCP traffic on port 5432 from the Kubernetes node IP range or the specific egress IPs of your cluster to the profile-db.
    • Why it works: This ensures that the network path between the services is open at the infrastructure level, allowing the TCP connection to be established.
  6. Incorrect Database Credentials/Authentication: The user-service is using incorrect credentials to connect to profile-db, leading to authentication failures that might be logged as connection issues.

    • Diagnosis: Check the logs of the user-service for authentication errors. Verify the database credentials stored in Kubernetes Secrets.
      kubectl get secret <user-service-db-secret-name> -n default -o yaml
      
      Compare these with the actual credentials for profile-db.
    • Fix: Update the Kubernetes Secret with the correct database username and password.
      kubectl edit secret <user-service-db-secret-name> -n default
      
      Then, restart the user-service pods to pick up the updated credentials.
    • Why it works: Using the correct credentials allows the user-service to successfully authenticate with the profile-db, resolving the connection refusal.

After fixing the network policy, you might immediately encounter issues with user-service failing to start up because its health check probe is now timing out due to a new, more aggressive readiness probe configuration.

Want structured learning?

Take the full DevOps & Platform Engineering course →