Datadog’s Incident Management and Postmortems are more than just tools for tracking outages; they’re a structured way to understand system behavior under duress and learn from it.

Let’s see how this looks in practice. Imagine a critical service, user-auth, starts throwing 5xx errors.

{
  "timestamp": "2023-10-27T10:30:01Z",
  "status": 500,
  "error": "Internal Server Error",
  "message": "Database connection pool exhausted",
  "service": "user-auth",
  "host": "auth-worker-03",
  "trace_id": "a1b2c3d4e5f6"
}

Datadog automatically ingests this log. If we have an alert configured for status:5xx on user-auth, it can trigger an incident.

Datadog Incident Creation:

  • Alert Triggered: The 5xx error rate for user-auth breaches the threshold (e.g., >5% over 5 minutes).
  • Incident Automatically Created: Datadog can be configured to automatically create an incident when an alert triggers. This incident automatically pulls in:
    • The triggering alert(s).
    • Related logs and metrics from the time of the alert.
    • A dedicated incident channel in Slack (if integrated).
    • A placeholder for the postmortem document.

The Incident Dashboard:

Once an incident is created, Datadog provides a dedicated dashboard. This isn’t just a static page; it’s dynamic.

  • Timeline: All events – alerts firing, alerts resolving, manual notes added, dashboard links, code deploys – are plotted chronologically.
  • Key Metrics: Graphs for user-auth error rates, latency, CPU/memory usage on affected hosts, and database connection counts are pinned.
  • Related Services: If user-auth depends on profile-db, Datadog will show correlated metrics for profile-db to help identify upstream or downstream impacts.
  • Team Assignment: You can assign an incident commander and responders directly within Datadog.

Investigating with Datadog:

The incident commander uses the dashboard to guide the investigation.

  1. Identify the Scope: Is it just user-auth? Are other services affected? Datadog’s service map can show dependencies and highlight unhealthy services.
  2. Pinpoint the Root Cause:
    • Logs: Search logs for user-auth around the incident start time. We see the Database connection pool exhausted message.
    • Metrics: Check the profile-db’s active connections metric. It’s maxed out at 200. The user-auth service’s database connection count is also maxed out.
    • Traces: If distributed tracing is enabled, examine traces for requests to user-auth during the incident. A trace might show that a specific type of query is holding connections open for too long.

The Mental Model:

Datadog Incident Management treats an incident not as a discrete event, but as a time-bound period of degraded performance that needs active management and a structured follow-up.

  • Alerting: The trigger. Datadog’s robust alerting engine, with its nuanced suppression rules and multi-alert conditions, ensures you’re notified without excessive noise.
  • Incident Room: The central hub. This is the virtual war room, consolidating all relevant context – logs, metrics, traces, dashboards, and communication – into a single, chronological view. It’s designed to minimize context switching during high-pressure situations.
  • Responder Roles: Datadog allows assigning roles (e.g., Incident Commander, Comms Lead, Technical Lead) to ensure clear responsibilities.
  • Postmortem: The learning phase. Every incident must have a postmortem. Datadog provides templates and integrates with tools like Google Docs or Confluence to facilitate this. The postmortem isn’t just a recap; it’s an analysis of:
    • Timeline: What happened, when.
    • Impact: What was the user-facing effect and duration?
    • Root Cause(s): Why did it happen?
    • Resolution: How was it fixed?
    • Lessons Learned: What can be improved?
    • Action Items: Concrete tasks to prevent recurrence, assigned owners, and due dates.

The "Why" Behind the Database connection pool exhausted:

In our example, the user-auth service likely has a connection pool configured for its database. If a bug causes queries to take an unusually long time, or if the pool size is too small for the current load, connections can get stuck in use. When all connections are held, new requests to user-auth that require database access will fail, leading to 5xx errors. Datadog’s ability to correlate the user-auth errors with the profile-db connection saturation is key.

The One Thing Most People Don’t Know:

Datadog incidents can automatically correlate code deploy events with the start of an incident. If a new deployment of user-auth or profile-db occurred just before the 5xx errors began, Datadog will highlight this correlation on the incident timeline. This is incredibly powerful for quickly identifying if a recent change is the likely culprit, bypassing lengthy manual investigation into deployment logs.

Fixing the Incident:

  1. Immediate Mitigation:
    • Option A (Quickest): Scale up the profile-db read replicas or increase the maximum connections if the database allows it dynamically. Example: ALTER SYSTEM SET max_connections = 400; (for PostgreSQL). This buys time.
    • Option B (If a bug is suspected): Trigger an immediate rollback of the latest user-auth deployment if correlation suggests it.
  2. Root Cause Analysis & Permanent Fix:
    • Investigate Slow Queries: Use Datadog APM to analyze traces and identify which queries are taking too long. Optimize those queries or add appropriate database indexes.
    • Tune Connection Pool: Adjust user-auth’s connection pool size and timeout settings based on observed load and query performance. For example, configure the pool in user-auth’s application code or configuration file: spring.datasource.hikari.maximum-pool-size=150 and spring.datasource.hikari.connection-timeout=30000.
    • Resource Scaling: If load is the issue, provision more database resources or implement read replicas more effectively.

The next step is to ensure your alerting is sensitive enough to catch connection pool saturation before it impacts users.

Want structured learning?

Take the full Datadog course →