Envoy itself doesn’t actually do anything until a control plane tells it what to do via its xDS APIs.
Let’s see Envoy in action. Imagine we have a simple setup: two services, service-a and service-b, running on different ports, and we want to route traffic from a client to service-b.
First, we need an Envoy configuration that tells it to listen for traffic and connect to a control plane. This envoy.yaml is pretty basic:
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: service_b
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters: [] # Clusters will be defined by the control plane
cds_config:
api_config_source:
api_type: GRPC
grpc_services:
envoy_custom_cluster_name: xds_cluster # This tells Envoy where to find the xds_cluster below
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
sni: "control-plane.example.com"
ads_config: # This is the crucial part for dynamically updating
api_type: GRPC
grpc_services:
envoy_custom_cluster_name: xds_cluster
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
sni: "control-plane.example.com"
# Define the xDS cluster that Envoy will use to connect to the control plane
# This cluster definition is essential for Envoy to bootstrap its connection
# to the control plane for dynamic configuration.
clusters:
- name: xds_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
# This is where Envoy *tries* to connect to your control plane.
# You'll need to replace this with your actual control plane's address.
load_assignment:
cluster_name: xds_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: "127.0.0.1" # Replace with your control plane IP/hostname
port_value: 50001 # Replace with your control plane gRPC port
# This defines how Envoy will communicate with the control plane.
# It's set up for TLS, which is highly recommended for security.
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
# If your control plane uses a self-signed cert or a specific CA,
# you'll need to configure trusted_ca_certificate or common_tls_context here.
sni: "control-plane.example.com"
Now, let’s imagine a very simple Go control plane that implements the DiscoveryService gRPC API and serves up a route and a cluster.
package main
import (
"context"
"log"
"net"
"sync"
"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
"github.com/envoyproxy/go-control-plane/pkg/resource/types"
"github.com/envoyproxy/go-control-plane/pkg/server/v3"
"google.golang.org/grpc"
// xDS API imports
discoveryv3 "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
)
const (
// The port where the control plane gRPC server will listen.
// Envoy's `xds_cluster` in `envoy.yaml` should point to this.
controlPlanePort = "50001"
)
var (
// This is the core of the control plane: a cache that holds the desired state.
// Envoy will watch this cache for updates.
snapshotCache cache.SnapshotCache
)
// This function sets up the initial desired state for Envoy.
// It defines a cluster for `service-b` and a route that directs traffic to it.
func createSnapshots() cache.Snapshot {
// Define the cluster that Envoy should send traffic to.
// This is the actual backend service.
cluster := cache.CreateCluster(&cluster.Cluster{
Name: "service_b", // This must match the `route.cluster` in envoy.yaml
// Envoy will resolve this DNS name to the actual IP(s) of service-b.
// You'd typically use a service discovery mechanism here.
DnsLookupFamily: cluster.Cluster_V4_ONLY, // Or V6_ONLY, AUTO
LoadAssignment: &endpoint.ClusterLoadAssignment{
ClusterName: "service_b",
Endpoints: []endpoint.LocalityLbEndpoints{
{
LbEndpoints: []endpoint.LbEndpoint{
{
// The actual address of service-b.
// Again, in a real system, this would be dynamic.
Endpoint: &endpoint.Endpoint{
Address: &core.Address{
Address: &core.Address_SocketAddress{
SocketAddress: &core.SocketAddress{
Address: "127.0.0.1", // Replace with actual service-b IP
PortValue: 8080, // Replace with actual service-b port
},
},
},
},
},
},
},
},
},
// Basic health checking configuration.
HealthChecks: []*core.HealthCheck{
{
// Envoy will periodically send HTTP GET requests to check if service-b is alive.
// Adjust the interval and timeout as needed.
Interval: &duration.Duration{Seconds: 5},
Timeout: &duration.Duration{Seconds: 1},
HttpHealthCheck: &core.HealthCheck_HttpHealthCheck{Path: "/healthz"},
UnhealthyThreshold: &duration.UInt32Value{Value: 3},
HealthyThreshold: &duration.UInt32Value{Value: 2},
NoTrafficInterval: &duration.Duration{Seconds: 1}, // Interval when no traffic is active
IntervalJitter: &duration.FractionalPercent{Numerator: 10, Denominator: 100}, // 10% jitter
Retries: &duration.UInt32Value{Value: 0}, // No retries for health checks
PassiveHealthCheck: &core.HealthCheck_PassiveHealthCheck{ActiveHealthCheckTimeout: &duration.Duration{Seconds: 1}}, // Timeout for active checks
DrainConnections: &duration.BoolValue{Value: true}, // Drain connections on unhealthy status
TransportSocket: nil, // Use default transport socket
TlsContext: nil, // No TLS for health checks
Http2HealthCheck: nil, // No HTTP/2 health check
GrpcHealthCheck: nil, // No gRPC health check
TcpHealthCheck: nil, // No TCP health check
InitialJitter: &duration.FractionalPercent{Numerator: 100, Denominator: 100}, // 100% initial jitter
AlwaysHealthCheck: false, // Not always health check
ExpectedInterval: &duration.Duration{Seconds: 5}, // Expected interval between checks
AggressiveHealthCheck: false, // Not aggressive health check
},
},
})
// Define a route that matches all incoming requests and forwards them to `service_b`.
route := cache.CreateRoute(&routev3.Route{
Match: &routev3.RouteMatch{
PathSpecifier: &routev3.RouteMatch_Prefix{Prefix: "/"},
},
Action: &routev3.Route_Route{
Route: &routev3.RouteAction{
ClusterSpecifier: &routev3.RouteAction_Cluster{Cluster: "service_b"},
// This defines how Envoy handles retries for requests that fail to reach `service_b`.
// It's configured for 3 retries, with a 100ms retry interval, and it will retry on
// gateway errors, connection failures, and timeouts.
RetryPolicy: &routev3.RetryPolicy{
RetryOn: "gateway-error,connect-failure,refused-stream,retriable-status-codes",
NumRetries: &duration.UInt32Value{
Value: 3,
},
PerTryTimeout: &duration.Duration{
Seconds: 5,
Nanos: 0,
},
RetryHostPredicate: nil, // No specific host predicate for retries
MaxInterval: &duration.Duration{ // Max interval between retries
Seconds: 1,
Nanos: 0,
},
RetryPriority: nil, // Default retry priority
UniformBucketBoundaries: nil, // No uniform bucket boundaries
Backoff: &duration.BackoffStrategy{ // Backoff strategy for retries
BaseInterval: &duration.Duration{
Seconds: 0,
Nanos: 100000000, // 100ms
},
MaxInterval: &duration.Duration{
Seconds: 1,
Nanos: 0,
},
},
// This defines specific HTTP status codes that should trigger a retry.
// 503 (Service Unavailable) and 504 (Gateway Timeout) are common.
RetriableStatusCodes: []uint32{503, 504},
},
},
},
})
// Create a VirtualHost that groups routes.
virtualHost := cache.CreateVirtualHost(&routev3.VirtualHost{
Name: "local_service",
Domains: []string{"*"}, // This virtual host will match any domain
Routes: []*routev3.Route{route},
})
// Create a RouteConfiguration that contains the VirtualHost.
routeConfig := cache.CreateRouteConfiguration(&routev3.RouteConfiguration{
Name: "local_route", // This name must match the `route_config.name` in envoy.yaml
VirtualHosts: []*routev3.VirtualHost{virtualHost},
})
// Create a Snapshot containing the defined cluster and route configuration.
return cache.NewSnapshot(
"v1", // Version for this snapshot
nil, // No secrets
[]types.Resource{cluster}, // List of clusters
[]types.Resource{routeConfig}, // List of route configurations
nil, // No listeners
nil, // No endpoints
nil, // No runtime
nil, // No secrets
)
}
// This function implements the gRPC DiscoveryService server.
// It handles requests from Envoy for configuration updates.
type xdsServer struct {
// The server needs access to the snapshot cache to serve configuration.
// It also needs to know which resources Envoy is interested in.
// The server is responsible for managing the xDS streams and sending updates.
server server.Server
}
// This is the main entry point for our control plane.
func main() {
// Initialize the snapshot cache.
snapshotCache = cache.NewSnapshotCache(true, cache.NewConsistentHashKey, log.New(log.Writer(), "", log.LstdFlags))
// Create the initial snapshot.
snap := createSnapshots()
// Set the initial snapshot in the cache. Envoy will fetch this.
// The key "test-domain" is arbitrary and used by the cache to track different configurations.
// Envoy will typically subscribe to specific resources, and the cache will manage
// which configuration is sent to which Envoy instance.
if err := snapshotCache.SetSnapshot("test-domain", snap); err != nil {
log.Fatalf("Failed to set snapshot: %v", err)
}
// Create a gRPC server.
grpcServer := grpc.NewServer()
// Create an xDS server that will handle discovery requests.
// This server integrates with the cache and manages the xDS streams.
// The server is responsible for receiving requests from Envoy (like DiscoveryRequest)
// and sending back responses (DiscoveryResponse) based on the cached configuration.
xdsServer := server.NewServer(context.Background(), snapshotCache, nil) // No callbacks for simplicity
// Register the DiscoveryService with the gRPC server.
// This makes our control plane available to receive xDS API calls from Envoy.
discoveryv3.RegisterAggregatedDiscoveryServiceServer(grpcServer, xdsServer)
// Start listening for gRPC connections on the specified port.
lis, err := net.Listen("tcp", ":"+controlPlanePort)
if err != nil {
log.Fatalf("Failed to listen: %v", err)
}
log.Printf("Control plane listening on port %s", controlPlanePort)
// Start serving gRPC requests.
if err := grpcServer.Serve(lis); err != nil {
log.Fatalf("Failed to serve: %v", err)
}
}
When Envoy starts with envoy.yaml, it will attempt to connect to 127.0.0.1:50001 (your control plane). It will then initiate gRPC streams for type.googleapis.com/envoy.config.cluster.v3.Cluster and type.googleapis.com/envoy.config.route.v3.RouteConfiguration. Your Go control plane, acting as the AggregatedDiscoveryService, will receive these requests, look up the corresponding resources in its snapshotCache, and send them back to Envoy. Envoy will then configure its listeners and routes accordingly.
The mental model here is that Envoy is a powerful, programmable proxy, but it’s fundamentally a "dumb" client until configured. The xDS APIs (LDS, RDS, CDS, EDS, SDS) are the language it speaks to receive this configuration. A control plane is essentially a gRPC server implementing these APIs, translating your desired state (e.g., "route all /api traffic to service-x") into the Envoy-specific xDS resource types.
The cache.SnapshotCache is the heart of the control plane. It’s where you define the desired state of your services, routes, and endpoints. When Envoy connects, it asks for updates based on a version identifier. Your control plane then uses the cache to serve the latest version of the configuration. The server.NewServer function ties the cache to the gRPC server, handling the complexities of the xDS protocol, including request/response management and stream lifetimes.
The most surprising thing about xDS is how it decouples Envoy’s data plane from its configuration plane. You can update routes, add new services, or change load balancing strategies without restarting Envoy. Envoy watches the xDS streams, and as soon as a new snapshot is available from the control plane, it applies the changes dynamically. This allows for zero-downtime updates and highly elastic infrastructure.
The ads_config section in envoy.yaml is what enables the "all-in-one" dynamic configuration. Instead of separate gRPC calls for Cluster Discovery Service (CDS), Route Discovery Service (RDS), etc., ADS consolidates them into a single gRPC stream, which is what AggregatedDiscoveryServiceServer handles. This makes managing configuration much simpler.
The next concept you’ll likely grapple with is implementing more sophisticated routing logic, like traffic splitting for canary deployments, request/response manipulation using HTTP filters, or managing TLS certificates dynamically.