Chaos engineering isn’t about breaking things randomly; it’s about proactively discovering vulnerabilities in your system by introducing controlled failures.
Imagine a distributed tracing system like Jaeger. We’ll use it to visualize a simulated network latency experiment.
Here’s a simplified Go application that makes a call to another service:
package main
import (
"fmt"
"net/http"
"time"
)
func main() {
http.HandleFunc("/process", handler)
http.ListenAndServe(":8080", nil)
}
func handler(w http.ResponseWriter, r *http.Request) {
// Simulate a dependency call
client := http.Client{Timeout: 5 * time.Second}
resp, err := client.Get("http://localhost:8081/dependency")
if err != nil {
http.Error(w, "Dependency unavailable", http.StatusInternalServerError)
return
}
defer resp.Body.Close()
fmt.Fprintf(w, "Processed successfully!")
}
And its dependency:
package main
import (
"fmt"
"net/http"
"time"
)
func main() {
http.HandleFunc("/dependency", dependencyHandler)
http.ListenAndServe(":8081", nil)
}
func dependencyHandler(w http.ResponseWriter, r *http.Request) {
time.Sleep(3 * time.Second) // Simulate work
fmt.Fprintf(w, "Dependency data")
}
If you run these and hit http://localhost:8080/process, it works. Now, let’s introduce chaos. We’ll use tc, a Linux command-line utility for traffic control, to add latency to the network connection between our two services.
# Add 2 seconds of latency to outgoing traffic from the dependency service
sudo tc qdisc add dev eth0 root netem delay 2000ms
(Replace eth0 with your actual network interface if different).
Now, when you hit http://localhost:8080/process again, the request will take at least 3 seconds (dependency’s sleep) + 2 seconds (added latency) = 5 seconds to complete. Crucially, the client http.Client has a Timeout of 5 seconds. It will just succeed, but the user experience is significantly degraded. If we increased the latency to 2.1 seconds, the client timeout would trigger, and the request would fail.
This experiment reveals that our system’s resilience is directly tied to the client’s timeout configuration and the maximum acceptable latency of its dependencies. The "weakness" isn’t in the dependency service itself, but in the contract between services, specifically how they handle delays.
The core problem chaos engineering solves is the "unknown unknowns" – scenarios that are too complex or improbable to have considered during design. By simulating failures like network partitions, service outages, increased latency, or resource exhaustion (CPU, memory), you can observe how your system behaves under duress and identify single points of failure, race conditions, and inadequate error handling.
A common chaos engineering tool is Gremlin. You’d define a "state" (e.g., "network latency") and an "effect" (e.g., "add 500ms latency to requests between service A and service B"). Gremlin then orchestrates these experiments, often integrating with your monitoring and alerting systems. The key is to start small, target specific components, and have a clear hypothesis about what you expect to happen.
For instance, a hypothesis might be: "Increasing latency to the database by 1 second will cause the user authentication service to degrade gracefully, returning a 'try again later' message, rather than crashing." If the experiment shows the service crashing, you’ve found a weakness. The fix might involve implementing circuit breakers, improving connection pooling, or optimizing database queries.
The true power comes from integrating this into your CI/CD pipeline. Imagine a canary deployment that, before fully rolling out, runs a brief chaos experiment. If the experiment fails, the canary is automatically rolled back. This prevents faulty code or configuration from reaching your entire user base.
One aspect that’s often overlooked is the impact of cascading failures. A small, isolated issue in a non-critical service can, under load or with specific timing, bring down an entire critical path. Chaos experiments that simulate upstream service failures or introduce load spikes can expose these brittle dependencies. For example, injecting CPU saturation into a background worker pool might seem harmless, but if that pool also handles critical health checks or rate limiting, the entire application could become unresponsive. The fix often involves better resource isolation (e.g., separate Kubernetes nodes or pods with strict resource limits) and ensuring that non-essential background tasks don’t consume resources needed by core functionalities.
After successfully running latency experiments and ensuring your services handle it gracefully, the next logical step is to experiment with injecting actual service failures, like stopping a dependent service entirely.