Build a Custom Envoy Control Plane with xDS APIs (2026)

Envoy itself doesn’t actually do anything until a control plane tells it what to do via its xDS APIs.

Let’s see Envoy in action. Imagine we have a simple setup: two services, service-a and service-b, running on different ports, and we want to route traffic from a client to service-b.

First, we need an Envoy configuration that tells it to listen for traffic and connect to a control plane. This envoy.yaml is pretty basic:

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: service_b
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters: [] # Clusters will be defined by the control plane
  cds_config:
    api_config_source:
      api_type: GRPC
      grpc_services:
        envoy_custom_cluster_name: xds_cluster # This tells Envoy where to find the xds_cluster below
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          sni: "control-plane.example.com"
  ads_config: # This is the crucial part for dynamically updating
    api_type: GRPC
    grpc_services:
      envoy_custom_cluster_name: xds_cluster
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        sni: "control-plane.example.com"

# Define the xDS cluster that Envoy will use to connect to the control plane
# This cluster definition is essential for Envoy to bootstrap its connection
# to the control plane for dynamic configuration.
clusters:
- name: xds_cluster
  connect_timeout: 0.25s
  type: STRICT_DNS
  lb_policy: ROUND_ROBIN
  # This is where Envoy *tries* to connect to your control plane.
  # You'll need to replace this with your actual control plane's address.
  load_assignment:
    cluster_name: xds_cluster
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: "127.0.0.1" # Replace with your control plane IP/hostname
              port_value: 50001     # Replace with your control plane gRPC port
  # This defines how Envoy will communicate with the control plane.
  # It's set up for TLS, which is highly recommended for security.
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
      # If your control plane uses a self-signed cert or a specific CA,
      # you'll need to configure trusted_ca_certificate or common_tls_context here.
      sni: "control-plane.example.com"

Now, let’s imagine a very simple Go control plane that implements the DiscoveryService gRPC API and serves up a route and a cluster.

package main

import (
	"context"
	"log"
	"net"
	"sync"

	"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
	"github.com/envoyproxy/go-control-plane/pkg/resource/types"
	"github.com/envoyproxy/go-control-plane/pkg/server/v3"
	"google.golang.org/grpc"

	// xDS API imports
	discoveryv3 "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
)

const (
	// The port where the control plane gRPC server will listen.
	// Envoy's `xds_cluster` in `envoy.yaml` should point to this.
	controlPlanePort = "50001"
)

var (
	// This is the core of the control plane: a cache that holds the desired state.
	// Envoy will watch this cache for updates.
	snapshotCache cache.SnapshotCache
)

// This function sets up the initial desired state for Envoy.
// It defines a cluster for `service-b` and a route that directs traffic to it.
func createSnapshots() cache.Snapshot {
	// Define the cluster that Envoy should send traffic to.
	// This is the actual backend service.
	cluster := cache.CreateCluster(&cluster.Cluster{
		Name: "service_b", // This must match the `route.cluster` in envoy.yaml
		// Envoy will resolve this DNS name to the actual IP(s) of service-b.
		// You'd typically use a service discovery mechanism here.
		DnsLookupFamily: cluster.Cluster_V4_ONLY, // Or V6_ONLY, AUTO
		LoadAssignment: &endpoint.ClusterLoadAssignment{
			ClusterName: "service_b",
			Endpoints: []endpoint.LocalityLbEndpoints{
				{
					LbEndpoints: []endpoint.LbEndpoint{
						{
							// The actual address of service-b.
							// Again, in a real system, this would be dynamic.
							Endpoint: &endpoint.Endpoint{
								Address: &core.Address{
									Address: &core.Address_SocketAddress{
										SocketAddress: &core.SocketAddress{
											Address: "127.0.0.1", // Replace with actual service-b IP
											PortValue: 8080,      // Replace with actual service-b port
										},
									},
								},
							},
						},
					},
				},
			},
		},
		// Basic health checking configuration.
		HealthChecks: []*core.HealthCheck{
			{
				// Envoy will periodically send HTTP GET requests to check if service-b is alive.
				// Adjust the interval and timeout as needed.
				Interval:            &duration.Duration{Seconds: 5},
				Timeout:             &duration.Duration{Seconds: 1},
				HttpHealthCheck:     &core.HealthCheck_HttpHealthCheck{Path: "/healthz"},
				UnhealthyThreshold:  &duration.UInt32Value{Value: 3},
				HealthyThreshold:    &duration.UInt32Value{Value: 2},
				NoTrafficInterval:   &duration.Duration{Seconds: 1}, // Interval when no traffic is active
				IntervalJitter:      &duration.FractionalPercent{Numerator: 10, Denominator: 100}, // 10% jitter
				Retries:             &duration.UInt32Value{Value: 0}, // No retries for health checks
				PassiveHealthCheck:  &core.HealthCheck_PassiveHealthCheck{ActiveHealthCheckTimeout: &duration.Duration{Seconds: 1}}, // Timeout for active checks
				DrainConnections:    &duration.BoolValue{Value: true}, // Drain connections on unhealthy status
				TransportSocket:     nil, // Use default transport socket
				TlsContext:          nil, // No TLS for health checks
				Http2HealthCheck:    nil, // No HTTP/2 health check
				GrpcHealthCheck:     nil, // No gRPC health check
				TcpHealthCheck:      nil, // No TCP health check
				InitialJitter:       &duration.FractionalPercent{Numerator: 100, Denominator: 100}, // 100% initial jitter
				AlwaysHealthCheck:   false, // Not always health check
				ExpectedInterval:    &duration.Duration{Seconds: 5}, // Expected interval between checks
				AggressiveHealthCheck: false, // Not aggressive health check
			},
		},
	})

	// Define a route that matches all incoming requests and forwards them to `service_b`.
	route := cache.CreateRoute(&routev3.Route{
		Match: &routev3.RouteMatch{
			PathSpecifier: &routev3.RouteMatch_Prefix{Prefix: "/"},
		},
		Action: &routev3.Route_Route{
			Route: &routev3.RouteAction{
				ClusterSpecifier: &routev3.RouteAction_Cluster{Cluster: "service_b"},
				// This defines how Envoy handles retries for requests that fail to reach `service_b`.
				// It's configured for 3 retries, with a 100ms retry interval, and it will retry on
				// gateway errors, connection failures, and timeouts.
				RetryPolicy: &routev3.RetryPolicy{
					RetryOn: "gateway-error,connect-failure,refused-stream,retriable-status-codes",
					NumRetries: &duration.UInt32Value{
						Value: 3,
					},
					PerTryTimeout: &duration.Duration{
						Seconds: 5,
						Nanos:   0,
					},
					RetryHostPredicate: nil, // No specific host predicate for retries
					MaxInterval: &duration.Duration{ // Max interval between retries
						Seconds: 1,
						Nanos:   0,
					},
					RetryPriority: nil, // Default retry priority
					UniformBucketBoundaries: nil, // No uniform bucket boundaries
					Backoff: &duration.BackoffStrategy{ // Backoff strategy for retries
						BaseInterval: &duration.Duration{
							Seconds: 0,
							Nanos:   100000000, // 100ms
						},
						MaxInterval: &duration.Duration{
							Seconds: 1,
							Nanos:   0,
						},
					},
					// This defines specific HTTP status codes that should trigger a retry.
					// 503 (Service Unavailable) and 504 (Gateway Timeout) are common.
					RetriableStatusCodes: []uint32{503, 504},
				},
			},
		},
	})

	// Create a VirtualHost that groups routes.
	virtualHost := cache.CreateVirtualHost(&routev3.VirtualHost{
		Name:    "local_service",
		Domains: []string{"*"}, // This virtual host will match any domain
		Routes:  []*routev3.Route{route},
	})

	// Create a RouteConfiguration that contains the VirtualHost.
	routeConfig := cache.CreateRouteConfiguration(&routev3.RouteConfiguration{
		Name:         "local_route", // This name must match the `route_config.name` in envoy.yaml
		VirtualHosts: []*routev3.VirtualHost{virtualHost},
	})

	// Create a Snapshot containing the defined cluster and route configuration.
	return cache.NewSnapshot(
		"v1", // Version for this snapshot
		nil,  // No secrets
		[]types.Resource{cluster}, // List of clusters
		[]types.Resource{routeConfig}, // List of route configurations
		nil,  // No listeners
		nil,  // No endpoints
		nil,  // No runtime
		nil,  // No secrets
	)
}

// This function implements the gRPC DiscoveryService server.
// It handles requests from Envoy for configuration updates.
type xdsServer struct {
	// The server needs access to the snapshot cache to serve configuration.
	// It also needs to know which resources Envoy is interested in.
	// The server is responsible for managing the xDS streams and sending updates.
	server server.Server
}

// This is the main entry point for our control plane.
func main() {
	// Initialize the snapshot cache.
	snapshotCache = cache.NewSnapshotCache(true, cache.NewConsistentHashKey, log.New(log.Writer(), "", log.LstdFlags))

	// Create the initial snapshot.
	snap := createSnapshots()

	// Set the initial snapshot in the cache. Envoy will fetch this.
	// The key "test-domain" is arbitrary and used by the cache to track different configurations.
	// Envoy will typically subscribe to specific resources, and the cache will manage
	// which configuration is sent to which Envoy instance.
	if err := snapshotCache.SetSnapshot("test-domain", snap); err != nil {
		log.Fatalf("Failed to set snapshot: %v", err)
	}

	// Create a gRPC server.
	grpcServer := grpc.NewServer()

	// Create an xDS server that will handle discovery requests.
	// This server integrates with the cache and manages the xDS streams.
	// The server is responsible for receiving requests from Envoy (like DiscoveryRequest)
	// and sending back responses (DiscoveryResponse) based on the cached configuration.
	xdsServer := server.NewServer(context.Background(), snapshotCache, nil) // No callbacks for simplicity

	// Register the DiscoveryService with the gRPC server.
	// This makes our control plane available to receive xDS API calls from Envoy.
	discoveryv3.RegisterAggregatedDiscoveryServiceServer(grpcServer, xdsServer)

	// Start listening for gRPC connections on the specified port.
	lis, err := net.Listen("tcp", ":"+controlPlanePort)
	if err != nil {
		log.Fatalf("Failed to listen: %v", err)
	}
	log.Printf("Control plane listening on port %s", controlPlanePort)

	// Start serving gRPC requests.
	if err := grpcServer.Serve(lis); err != nil {
		log.Fatalf("Failed to serve: %v", err)
	}
}

When Envoy starts with envoy.yaml, it will attempt to connect to 127.0.0.1:50001 (your control plane). It will then initiate gRPC streams for type.googleapis.com/envoy.config.cluster.v3.Cluster and type.googleapis.com/envoy.config.route.v3.RouteConfiguration. Your Go control plane, acting as the AggregatedDiscoveryService, will receive these requests, look up the corresponding resources in its snapshotCache, and send them back to Envoy. Envoy will then configure its listeners and routes accordingly.

The mental model here is that Envoy is a powerful, programmable proxy, but it’s fundamentally a "dumb" client until configured. The xDS APIs (LDS, RDS, CDS, EDS, SDS) are the language it speaks to receive this configuration. A control plane is essentially a gRPC server implementing these APIs, translating your desired state (e.g., "route all /api traffic to service-x") into the Envoy-specific xDS resource types.

The cache.SnapshotCache is the heart of the control plane. It’s where you define the desired state of your services, routes, and endpoints. When Envoy connects, it asks for updates based on a version identifier. Your control plane then uses the cache to serve the latest version of the configuration. The server.NewServer function ties the cache to the gRPC server, handling the complexities of the xDS protocol, including request/response management and stream lifetimes.

The most surprising thing about xDS is how it decouples Envoy’s data plane from its configuration plane. You can update routes, add new services, or change load balancing strategies without restarting Envoy. Envoy watches the xDS streams, and as soon as a new snapshot is available from the control plane, it applies the changes dynamically. This allows for zero-downtime updates and highly elastic infrastructure.

The ads_config section in envoy.yaml is what enables the "all-in-one" dynamic configuration. Instead of separate gRPC calls for Cluster Discovery Service (CDS), Route Discovery Service (RDS), etc., ADS consolidates them into a single gRPC stream, which is what AggregatedDiscoveryServiceServer handles. This makes managing configuration much simpler.

The next concept you’ll likely grapple with is implementing more sophisticated routing logic, like traffic splitting for canary deployments, request/response manipulation using HTTP filters, or managing TLS certificates dynamically.