The most surprising thing about Service Level Objectives (SLOs) is that they aren’t about meeting a target, but about managing risk. If you’re constantly hitting 99.99% availability, you’re likely over-provisioning and wasting money.
Let’s see how this plays out with a real-time example. Imagine we’re tracking an e-commerce checkout service.
{
"serviceId": "checkout-service-prod",
"metric": "request_success_rate",
"timeframe": {
"from": "now-1h",
"to": "now"
},
"threshold": {
"op": "less_than",
"value": 0.995
},
"evaluationType": "rolling",
"warningThreshold": {
"op": "less_than",
"value": 0.999
}
}
This JSON defines an SLO for our checkout-service-prod. It tracks the request_success_rate over the last hour. If the success rate drops below 99.5%, the SLO is violated. A warning is triggered if it drops below 99.9%. The evaluationType: "rolling" means the SLO is continuously evaluated over the defined timeframe, not just at specific intervals.
The Problem SLOs Solve
Before SLOs, we often had vague "availability" goals. "We need to be up 99.9% of the time." This is a Service Level Agreement (SLA) – a promise to a customer. But what does that mean for the engineering team? How do we translate that customer promise into actionable engineering targets? Without a clear, measurable objective tied to a specific metric, teams end up either over-engineering to be safe (and expensive) or under-engineering and risking SLA breaches. SLOs bridge this gap by defining measurable, achievable targets for internal teams, directly supporting the broader SLA.
How SLOs Work Internally (Dynatrace)
Dynatrace collects a firehose of metrics from your services. For our checkout service, it’s tracking every request, its success or failure, and the latency. When you define an SLO, Dynatrace doesn’t just store these metrics; it actively evaluates them against your defined thresholds.
- Data Ingestion: Dynatrace agents and OneAgents continuously send metrics (like request counts, error codes, response times) to the Dynatrace platform.
- Metric Aggregation: For the
request_success_rate, Dynatrace calculates(total_requests - error_requests) / total_requestsover the specifiedtimeframe. - Threshold Evaluation: The platform compares the calculated
request_success_rateagainst thethreshold(0.995) andwarningThreshold(0.999). - Status Update: The SLO status (OK, WARNING, CRITICAL) is updated in near real-time. If the rolling 1-hour success rate dips below 0.995, the SLO enters a CRITICAL state.
Levers You Control
The key levers for managing SLOs in Dynatrace are:
- The Metric: What are you actually measuring? Is it request success rate, latency, throughput, or something else? Choose metrics that directly reflect user experience and business impact. For example, measuring "API calls processed" might be less useful than measuring "successful checkout completions."
- The Timeframe: How long is your evaluation window? A short window (e.g., 5 minutes) makes it easier to hit targets but less representative of long-term reliability. A long window (e.g., 30 days) is more stable but slower to react to transient issues.
now-1h,now-7d,now-30dare common. - The Threshold: What is the acceptable level of "badness"? This is the core of your risk management. A 99.5% success rate means 5 in 1000 requests fail. Is that acceptable for your checkout service?
- The Warning Threshold: This provides an early alert before you breach the main SLO, giving your team time to investigate and remediate before it impacts users significantly.
- Evaluation Type:
rolling(continuous evaluation over the timeframe) vs.periodic(evaluation at fixed intervals). Rolling is generally preferred for real-time operational awareness.
The most effective way to define a good SLO is to start with your SLI (Service Level Indicator) – the raw metric. If your SLI is request_success_rate measured by http_status_code >= 400 for a specific service endpoint, your SLO might be request_success_rate(over_last_7_days) >= 99.9%. This moves from a raw measurement to a statement of acceptable performance.
The next concept you’ll grapple with is how to automate remediation when an SLO is breached.