Dynatrace Site Reliability Guardian (SRG) is a powerful tool for automatically validating software releases, but its true magic lies in its ability to shift validation from a reactive, post-deployment panic to a proactive, integrated part of the release pipeline.

Let’s see it in action. Imagine a simple microservice deployment. We’re pushing a new version of our user-service.

Here’s a snippet of what SRG might be looking at, simplified for clarity. This isn’t a command, but a representation of the configuration SRG uses to define a "good" release:

release_validation:
  name: user-service-v1.2.0
  service: user-service
  environment: production
  checks:
    - type: metric
      name: p95_latency
      threshold: 500ms
      operator: less_than
      duration: 5m
    - type: error_rate
      name: http_5xx_count
      threshold: 0
      operator: less_than
      duration: 5m
    - type: throughput
      name: request_count
      threshold: 1000
      operator: greater_than
      duration: 5m

When a new version of user-service is deployed to production, SRG, integrated with Dynatrace, starts observing these metrics. If the p95_latency for /api/users spikes above 500ms for more than 5 minutes, or if the 5xx error rate exceeds 0, SRG will flag the release as unstable. It doesn’t just report it; it can be configured to automatically trigger a rollback.

The problem SRG solves is the inherent risk and manual effort in validating deployments. Traditionally, after a release, SREs would be glued to dashboards, manually checking key performance indicators (KPIs) and error logs. This is inefficient, error-prone, and often too late to prevent customer impact. SRG automates this, providing objective, data-driven validation.

Internally, SRG leverages Dynatrace’s deep observability. It queries Dynatrace for specific metrics, traces, and logs related to the service and release in question. It uses a declarative configuration (like the YAML above) to define what constitutes a successful deployment. This configuration acts as a Service Level Objective (SLO) for the release itself.

The exact levers you control are the checks. Each check has a type (metric, error rate, throughput, custom event, etc.), a name (referencing a Dynatrace metric or a custom event identifier), a threshold, an operator (less_than, greater_than, equals), and a duration over which the check is evaluated. You can define multiple checks, creating a comprehensive validation strategy. For instance, you might add a check for a specific custom event indicating a critical business transaction is completing successfully.

What most people don’t realize is that SRG can also validate negative conditions. Instead of just checking if error rates are low, you can configure it to check if a specific known failure mode (e.g., a particular exception in logs, or a specific error code from a downstream dependency) is absent. This is powerful for ensuring that a release hasn’t introduced a known, previously resolved bug.

Once a release is validated by SRG, it can automatically signal downstream systems, like a CI/CD pipeline, to proceed with the rollout, or even trigger the next stage of a canary deployment. This creates a truly automated and safe release process.

The next logical step after mastering SRG’s release validation is exploring its capabilities for automated incident response.

Want structured learning?

Take the full Dynatrace course →