Dynatrace Cloud Automation isn’t just another dashboard; it’s a system that uses AI to proactively fix issues before you even see them.
Let’s say you have a service, user-api, that’s suddenly experiencing increased latency. Normally, you’d get an alert, dive into logs, trace requests, and manually scale up instances. With Cloud Automation, Dynatrace’s AI, Davis, detects the increased latency, correlates it with higher request volume, and automatically triggers a pre-defined remediation action. This action might be to scale up the user-api deployment by two replicas. The entire process, from detection to resolution, happens in minutes, often before any users are impacted.
Here’s how it works under the hood. Dynatrace continuously monitors your environment, collecting metrics, traces, and logs. Its AI engine, Davis, analyzes this data in real-time to establish a baseline of normal behavior. When deviations occur, Davis identifies the root cause. Cloud Automation then leverages this root-cause analysis to execute automated workflows. These workflows are defined using a declarative, Kubernetes-native approach, often expressed in YAML.
Consider a common scenario: a microservice, payment-service, starts failing health checks due to increased error rates.
- Detection: Dynatrace observes a spike in
payment-serviceerror rates (e.g., from 0.1% to 5%) and failing health checks. - Root Cause Analysis: Davis identifies that the increased errors are correlated with a new deployment of
payment-servicethat introduced a bug, and that this bug is causing it to exhaust its database connection pool. - Automated Workflow Trigger: Cloud Automation, recognizing this specific pattern of "high error rate + connection pool exhaustion + recent deployment," triggers a pre-configured workflow.
- Remediation: The workflow might execute a
kubectl rollbackcommand to revertpayment-serviceto its previous stable version.
The power of Cloud Automation lies in its ability to bridge the gap between observability and action. It moves beyond simply telling you what is wrong to automatically doing something about it.
Let’s look at a concrete example of a remediation workflow. Suppose you want to automatically scale up a Kubernetes deployment named frontend-app when its CPU utilization exceeds 80% for 5 minutes.
Your workflow might look something like this, defined in a workflow.yaml file:
apiVersion: automation.dynatrace.com/v1
kind: Workflow
metadata:
name: scale-frontend-app
spec:
trigger:
type: metric
metric:
name: kubernetes.cpu.usage.total
condition: "> 80"
query: "dtql:builtin:kubernetes.cpu.usage.total:merge(0):avg:auto.group(k8s.cluster.name,k8s.node.name,k8s.pod.name,k8s.container.name):filter(eq(k8s.pod.name, 'frontend-app-*'))"
duration: 5m
actions:
- type: kubernetes
kubernetes:
command: scale
deployment: frontend-app
replicas: 3 # Scales up to 3 replicas
In this example:
- The
triggersection defines that this workflow should run when thekubernetes.cpu.usage.totalmetric for pods matchingfrontend-app-*exceeds 80% for 5 minutes. - The
actionssection specifies akubernetesaction. - The
command: scaletells it to adjust the number of replicas. deployment: frontend-apptargets the specific deployment.replicas: 3sets the desired number of replicas. If the current number is less than 3, it will scale up.
This is not about setting up simple if-then rules. Cloud Automation integrates deeply with Dynatrace’s AI, allowing for much more sophisticated triggers based on complex AI-driven insights, not just raw metric thresholds. For instance, you could trigger a workflow when Davis detects a "performance degradation" event for a specific service, regardless of the exact metric that Davis identified as the root cause.
The real magic happens when you combine different Dynatrace capabilities. Imagine Davis identifies a "potential memory leak" in your auth-service. A Cloud Automation workflow could then:
- Trigger a diagnostic snapshot collection for the
auth-servicepod. - Send a notification to the on-call SRE team via Slack, including the snapshot link and Davis’s root cause analysis.
- If the memory usage continues to climb beyond a critical threshold, automatically restart the
auth-servicepod as a last resort.
This layered approach ensures that automated actions are intelligent and context-aware, minimizing the risk of unintended consequences. You’re not just automating tasks; you’re automating intelligent responses.
One aspect that often surprises people is how granularly you can define triggers. It’s not just about high-level service metrics. You can trigger workflows based on specific Problem events identified by Davis, which encapsulate a root cause and its impact across multiple services. This means you can react to sophisticated, multi-faceted issues. For example, if Davis flags a "database connection saturation" problem affecting user-api, order-service, and payment-service, a single workflow can be triggered to investigate the database, scale related services, or even initiate a database failover, all coordinated through Cloud Automation.
The next step after mastering automated remediation is exploring automated change validation, where Dynatrace verifies that your automated fixes or manual deployments have actually resolved the underlying issue before closing the loop.