Test Distributed Systems: Unit, Integration, and Chaos Testing Strategies (2026)

Testing distributed systems is fundamentally about proving that independent components, each with its own failures, can still collectively achieve a desired outcome.

Imagine you’re building a simple e-commerce checkout flow. You have a CatalogService, an OrderService, and a PaymentService.

Here’s a snippet of how these might interact:

# catalog_service.py
class CatalogService:
    def get_item_details(self, item_id):
        # In a real system, this would query a database or another service
        if item_id == "XYZ123":
            return {"name": "Super Widget", "price": 19.99, "stock": 10}
        return None

# order_service.py
from catalog_service import CatalogService
from payment_service import PaymentService # Assume this exists

class OrderService:
    def __init__(self):
        self.catalog = CatalogService()
        self.payment = PaymentService() # Dependency

    def create_order(self, user_id, item_id, quantity):
        item_details = self.catalog.get_item_details(item_id)
        if not item_details or item_details["stock"] < quantity:
            raise ValueError("Item out of stock or not found.")

        total_price = item_details["price"] * quantity
        payment_successful = self.payment.process_payment(user_id, total_price)

        if payment_successful:
            # In a real system, this would update stock, create order in DB, etc.
            print(f"Order created for {user_id}, item {item_id}.")
            return {"order_id": "ORD98765", "status": "completed"}
        else:
            raise ConnectionError("Payment processing failed.")

# payment_service.py (simplified)
class PaymentService:
    def process_payment(self, user_id, amount):
        # Simulate external payment gateway
        print(f"Attempting to process payment of ${amount} for {user_id}...")
        # In a real scenario, this could fail due to network, auth, etc.
        return True # For now, assume success

This looks simple enough, right? But what happens when CatalogService is slow, PaymentService times out, or a network partition occurs between them? Unit, integration, and chaos testing are your tools to find out before your customers do.

Unit Testing: The Isolated View

Unit tests focus on a single, isolated component. For our OrderService, a unit test would verify its logic without actually calling the real CatalogService or PaymentService. We use mocks or stubs to simulate their behavior.

# test_order_service.py
import unittest
from unittest.mock import MagicMock
from order_service import OrderService

class TestOrderService(unittest.TestCase):
    def test_create_order_successful(self):
        # Mock dependencies
        mock_catalog = MagicMock()
        mock_payment = MagicMock()

        # Configure mock behavior
        mock_catalog.get_item_details.return_value = {"name": "Widget", "price": 10.0, "stock": 5}
        mock_payment.process_payment.return_value = True

        # Instantiate service with mocks
        order_service = OrderService()
        order_service.catalog = mock_catalog
        order_service.payment = mock_payment

        # Assert
        result = order_service.create_order("user1", "WIDGET1", 2)
        self.assertEqual(result["status"], "completed")
        mock_catalog.get_item_details.assert_called_once_with("WIDGET1")
        mock_payment.process_payment.assert_called_once_with("user1", 20.0)

    def test_create_order_out_of_stock(self):
        mock_catalog = MagicMock()
        mock_payment = MagicMock()
        mock_catalog.get_item_details.return_value = {"name": "Widget", "price": 10.0, "stock": 1}
        order_service = OrderService()
        order_service.catalog = mock_catalog
        order_service.payment = mock_payment

        with self.assertRaisesRegex(ValueError, "Item out of stock"):
            order_service.create_order("user1", "WIDGET1", 2)

if __name__ == '__main__':
    unittest.main()

This verifies that OrderService correctly handles stock checks and calls PaymentService with the right parameters when everything is ideal. It’s fast and reliable for testing individual logic paths.

Integration Testing: The Handshake

Integration tests check how components work together. Here, we’d want to see if OrderService can successfully interact with a real (or a test instance of) CatalogService and PaymentService. This is where you catch issues like incorrect API calls, data format mismatches, or unexpected latency.

Let’s say we have our services running on different ports or as separate processes. A typical integration test might involve:

Setting up a test environment: This could be spinning up Docker containers for each service.
Making an API call: For example, sending a POST request to OrderService’s /create_order endpoint.
Asserting the outcome: Checking the response code, verifying that CatalogService was queried, and that PaymentService was indeed called.

# Consider this a conceptual example using a testing framework like pytest
# and a tool like docker-compose for service orchestration.

# test_integration.py
import requests
import pytest

ORDER_SERVICE_URL = "http://localhost:8000" # Assuming OrderService runs here
CATALOG_SERVICE_URL = "http://localhost:8001" # Assuming CatalogService runs here
PAYMENT_SERVICE_URL = "http://localhost:8002" # Assuming PaymentService runs here

@pytest.fixture(scope="module", autouse=True)
def setup_test_services():
    # This would typically involve starting Docker containers
    # for catalog, order, and payment services.
    print("Setting up test services...")
    # For demonstration, assume they are already running.
    yield
    print("Tearing down test services...")

def test_order_creation_integration():
    # Mock external dependencies for Catalog and Payment if they are complex
    # Or ensure their test instances are available.
    # For this example, we assume they are accessible and functional.

    payload = {
        "user_id": "user2",
        "item_id": "XYZ123",
        "quantity": 1
    }
    response = requests.post(f"{ORDER_SERVICE_URL}/create_order", json=payload)

    assert response.status_code == 200
    data = response.json()
    assert data["status"] == "completed"
    assert "order_id" in data

    # Further assertions could check logs or database states
    # to ensure CatalogService.get_item_details and PaymentService.process_payment were called.

This test requires all three services to be running and communicating correctly. It’s more brittle than unit tests but provides higher confidence that the system works as a whole.

Chaos Testing: Embracing Failure

Distributed systems are designed to be resilient, but resilience is often only proven when things go wrong. Chaos testing (or chaos engineering) involves deliberately injecting failures into your system to see how it reacts and to uncover weaknesses. This is not about if your system will fail, but how it will fail and how gracefully it recovers.

Tools like Gremlin, Chaos Mesh, or even custom scripts can be used to simulate:

Network Latency/Packet Loss: tc netem delay 100ms loss 1% (Linux command to add 100ms delay and 1% packet loss to an interface).
Service Restarts/Crashes: Killing a CatalogService pod in Kubernetes.
Resource Exhaustion: Flooding a service with CPU or memory requests.
Unavailability: Blocking network access to a dependency.

A chaos experiment might look like this:

Experiment Name: Payment Gateway Timeout During Checkout

Hypothesis: The OrderService will gracefully handle a timeout from the PaymentService, returning an informative error to the user and not leaving the order in an inconsistent state.

Environment: Production staging or a dedicated chaos testing environment.

Steps:

Baseline: Ensure the checkout flow is working normally.
Inject Failure: Use a chaos tool to introduce a 5-second network latency specifically to the PaymentService when OrderService tries to connect to it.
Execute Scenario: Trigger a checkout transaction.
Observe: Monitor logs, metrics, and user-facing errors.
Verify Hypothesis:
- Did OrderService eventually time out the PaymentService call?
- Did it return a specific error like "Payment processing delayed"?
- Was the order state consistent (e.g., not marked as completed if payment failed)?
- Did the system recover automatically once the injected failure was removed?
Rollback: Remove the injected failure.

This kind of testing forces you to think about fallback mechanisms, retry strategies (with backoff!), circuit breakers, and how to provide meaningful feedback to users when downstream services misbehave. It’s the most advanced form of testing and crucial for building truly robust distributed systems.

The next challenge you’ll face is managing the complexity of testing across many microservices, leading into strategies for end-to-end testing and service virtualization.