Envoy’s hot restart lets you upgrade the binary without dropping any active connections.
Let’s see this in action. Imagine we have a running Envoy process serving traffic. We’ll simulate an upgrade by starting a new Envoy binary with a different configuration, and then gracefully shutting down the old one.
Here’s a basic envoy.yaml configuration we’ll use:
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
routes:
- match:
prefix: "/"
route:
cluster: some_service
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: some_service
connect_timeout: 0.25s
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
dns_lookup_family: V4_ONLY
load_assignment:
cluster_name: some_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8080
We’ll start this initial Envoy:
/usr/local/bin/envoy -c /etc/envoy/envoy.yaml --service-cluster initial-envoy --service-node initial-node --hot-restart-file=/tmp/envoy_restart.sock
The key here is --hot-restart-file=/tmp/envoy_restart.sock. This creates a Unix domain socket that the new Envoy process will connect to.
Now, let’s say we have a new configuration, envoy_v2.yaml, which might change the upstream cluster or add a new filter. For this example, we’ll just change the upstream port to 8081:
# envoy_v2.yaml
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
routes:
- match:
prefix: "/"
route:
cluster: some_service_v2 # Note: cluster name changed for clarity, but not strictly necessary for hot-restart
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: some_service_v2 # Note: cluster name changed
connect_timeout: 0.25s
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
dns_lookup_family: V4_ONLY
load_assignment:
cluster_name: some_service_v2
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8081 # Changed port
To perform the upgrade, we start the new Envoy binary, pointing to the same hot-restart file:
/usr/local/bin/envoy -c /etc/envoy/envoy_v2.yaml --service-cluster initial-envoy --service-node initial-node --hot-restart-file=/tmp/envoy_restart.sock
When the new Envoy process starts, it detects that a process is already listening on /tmp/envoy_restart.sock. It connects to this socket, and the old Envoy process, upon receiving the connection, transfers its state: listener sockets, connection state, and statistics. The new Envoy takes over the listeners. Once the new process is ready, you signal the old process to exit.
You can signal the old process to exit by sending a SIGTERM signal to its PID.
# Find the PID of the old Envoy process
# Assuming it's the only envoy process running for simplicity
OLD_ENVOY_PID=$(pgrep -f "envoy -c /etc/envoy/envoy.yaml")
# Send SIGTERM
kill -SIGTERM $OLD_ENVOY_PID
The old Envoy process will then gracefully shut down, closing any new connections it might have accepted but not yet fully processed, while allowing existing, active connections to complete their lifecycle. The new Envoy process, already holding the active connections, continues serving traffic seamlessly.
The core mechanism relies on the parent-child process relationship facilitated by the hot-restart file. The new process becomes the "child" that connects to the "parent" (the old process) via the socket. This handshake allows for the transfer of file descriptors (the active listener sockets) and critical runtime state.
When you start the new Envoy, it doesn’t immediately take over. It first connects to the hot-restart socket. The old Envoy process, upon detecting this connection, performs a series of internal actions. It duplicates its listening sockets and passes their file descriptors over the hot-restart socket to the new process. Simultaneously, it transfers internal state such as connection tracking information, statistics, and certain runtime configurations. The new Envoy then binds to these duplicated sockets, effectively taking over the listening endpoints. At this point, the new Envoy is ready to accept new connections and handle existing ones that were in flight. The old Envoy, having completed the state transfer, then waits for a signal (like SIGTERM) to initiate its own graceful shutdown. This graceful shutdown ensures that any connections it was still processing are either handed off or allowed to finish.
A common point of confusion is understanding what "state" is transferred. It’s not just configuration; it’s the active network connections themselves. Envoy serializes the necessary information about each active connection (like stream ID, headers, and payload buffers) and sends it over the hot-restart socket. The new process deserializes this information and reconstructs the connection state, allowing it to pick up where the old process left off. This prevents dropped connections during the upgrade.
It’s crucial to ensure the --hot-restart-file path is accessible and writable by both Envoy processes. If the path is incorrect or permissions are denied, the hot-restart mechanism will fail, and the new Envoy will start as a completely independent process, leading to dropped connections.
The new Envoy process will attempt to bind to the same ports as the old one. If there’s any issue with this binding (e.g., the old process didn’t fully shut down or another process is already listening), the new Envoy might fail to start or might not accept traffic. This is where you’d typically see errors related to Address already in use if the old process wasn’t signaled correctly or gracefully.