Service Discovery Explained: From DNS to Consul

Service discovery is how services find each other in a distributed system, but the real magic is that it lets services stop knowing about each other explicitly.

Let’s see it in action. Imagine a simple web application with a frontend service and a backend API.

Frontend Service (Node.js)

const express = require('express');
const axios = require('axios');
const app = express();
const port = 3000;

// Service discovery client (e.g., Consul client)
const consul = require('consul')({ promisify: true });

app.get('/items', async (req, res) => {
    try {
        // Ask Consul for the address of the 'backend-api' service
        const services = await consul.agent.service.nodes({ service: 'backend-api' });
        if (!services || services.length === 0) {
            return res.status(504).send('Backend API not available');
        }

        // Pick the first healthy instance (in a real scenario, you'd do load balancing)
        const backendInstance = services[0];
        const backendUrl = `http://${backendInstance.Address}:${backendInstance.Port}`;

        // Make the request to the backend
        const backendResponse = await axios.get(`${backendUrl}/api/items`);
        res.json(backendResponse.data);
    } catch (error) {
        console.error('Error calling backend API:', error.message);
        res.status(500).send('Error retrieving items');
    }
});

app.listen(port, () => {
    console.log(`Frontend service listening on port ${port}`);
});

Backend API Service (Node.js)

const express = require('express');
const app = express();
const port = 4000;
const consul = require('consul')({ promisify: true });

app.get('/api/items', (req, res) => {
    res.json([{ id: 1, name: 'Widget' }, { id: 2, name: 'Gadget' }]);
});

app.listen(port, () => {
    console.log(`Backend API service listening on port ${port}`);

    // Register this service with Consul
    consul.agent.service.register({
        name: 'backend-api',
        id: 'backend-api-instance-1', // Unique ID for this instance
        port: port,
        address: '192.168.1.100', // The IP address this service is bound to
        check: {
            http: `http://192.168.1.100:${port}/health`,
            interval: '10s',
            timeout: '1s'
        }
    }).catch(err => {
        console.error('Error registering service with Consul:', err);
    });
});

// Health check endpoint for Consul
app.get('/health', (req, res) => {
    res.sendStatus(200);
});

In this example, the frontend-service doesn’t hardcode the IP address or port of the backend-api. Instead, it asks a central registry (Consul) for a list of available backend-api instances. Consul maintains this list by having each service instance register itself upon startup. It also performs health checks to ensure only healthy instances are discoverable.

The core problem service discovery solves is dynamic infrastructure. In a microservices architecture, services are constantly starting, stopping, scaling up, and scaling down. If you hardcode network locations, your system breaks every time an instance changes. Service discovery decouples services by providing an abstraction layer for network endpoints. The frontend service needs to know how to ask for the backend, but not where the backend lives.

The mental model involves three key components:

Service Registry: A database that stores information about available service instances (name, IP address, port, health status). Consul, etcd, and ZooKeeper are common examples. Kubernetes’ internal DNS is another form.
Service Provider: The service that wants to be discovered. It registers itself with the registry and typically provides health check information.
Service Consumer: The service that needs to find another service. It queries the registry to get the network address of an available provider.

The "how it works internally" part depends on the registry. Consul uses a gossip protocol to maintain a distributed, eventually consistent view of services. etcd uses Raft for strong consistency. Kubernetes uses etcd and a DNS server. The consumer can either poll the registry periodically or use watches/long-polling to be notified of changes.

The exact levers you control are primarily in how services register and how consumers query. For registration:

name: The logical name of the service (e.g., user-service).
id: A unique identifier for a specific instance of a service (e.g., user-service-abc123).
address: The IP address the service is listening on.
port: The port the service is listening on.
tags: Metadata for filtering (e.g., production, region:us-east-1).
check: The health check mechanism (HTTP endpoint, TCP port, script execution).

For consumption:

Discovery Method: Client-side discovery (like the example above, where the client queries the registry) vs. Server-side discovery (where a load balancer or proxy handles the discovery and routing).
Load Balancing Strategy: Round-robin, least connections, random. The consumer or proxy decides which healthy instance to send traffic to.
Health Check Configuration: How often to check, what constitutes a failure, how long to wait before marking an instance as unhealthy.

When using client-side discovery with a registry like Consul, the consumer often makes an API call like consul.agent.service.nodes({ service: 'my-service' }). This returns a list of service instances. The client then needs to select one, often using a simple round-robin or random selection. The crucial detail is that the client must handle the case where the returned list is empty, indicating no healthy instances are available.

The next concept to grapple with is how to handle failures in the service discovery mechanism itself, and what happens when a service instance becomes unhealthy after being registered.