An etcd watcher reconnecting isn’t a failure, it’s a feature designed to keep your application state synchronized even when network hiccups occur.
Let’s see it in action. Imagine we have a simple etcd key mykey that we want to watch for changes.
package main
import (
"context"
"fmt"
"log"
"time"
"go.etcd.io/etcd/client/v3"
)
func main() {
cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second,
})
if err != nil {
log.Fatal(err)
}
defer cli.Close()
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Watch for changes on /mykey
rch := cli.Watch(ctx, "mykey")
fmt.Println("Watching for changes on /mykey...")
for watchResp := range rch {
if watchResp.Err() != nil {
// This is where we'll see reconnects
log.Printf("Watcher error: %v", watchResp.Err())
// In a real app, you'd likely have retry logic here
// For this example, we'll just log and continue,
// etcd client will attempt to reconnect automatically.
continue
}
for _, ev := range watchResp.Events {
fmt.Printf("Event received: %s %s : %s\n", ev.Type, ev.Kv.Key, ev.Kv.Value)
}
}
}
Now, let’s simulate a network disruption. If you stop the etcd server and then restart it, your Go application’s watcher will automatically attempt to re-establish its connection and resume watching from where it left off. The go.etcd.io/etcd/client/v3 library handles this reconnect logic for you. When a watchResp.Err() occurs, it indicates a temporary disconnection, not a permanent failure. The client library will try to reconnect and resync the watch stream.
The core problem etcd watchers solve is maintaining a real-time, consistent view of distributed state across multiple application instances. Instead of clients polling etcd for changes, etcd pushes updates to the watchers. This is crucial for distributed coordination, leader election, and configuration management.
Internally, etcd watchers use gRPC streams. When a client establishes a watch, it creates a persistent gRPC connection with the etcd server. The server then pushes WatchResponse messages over this stream as events occur. If the network connection breaks, the gRPC stream is interrupted. The etcd client library detects this interruption (often via a context cancellation or a low-level network error) and initiates a reconnect sequence. During the reconnect, it will attempt to re-establish the gRPC stream and, importantly, will try to resume the watch from the last known revision number. This ensures that no events are missed.
The key levers you control are primarily within the clientv3.Config. The DialTimeout (e.g., 5 * time.Second) dictates how long the client will wait to establish an initial connection. While not directly related to reconnects, a sensible timeout prevents your application from hanging indefinitely if etcd is unreachable on startup. More importantly for watcher resilience, the etcd client library implicitly handles keep-alives and retries. You don’t typically need to manually implement retry loops for the watcher stream itself; the library’s internal mechanisms are designed to keep it alive. However, you do need to handle the watchResp.Err() in your application loop to acknowledge these transient network issues and ensure your application logic gracefully continues or retries operations that might have been affected by the temporary disconnection.
The etcd client library manages the complexity of reconnecting and resynchronizing the watch stream. When a connection is lost, the client automatically attempts to re-establish it. Upon successful reconnection, it will send a WatchRequest to the server with the withRev option set to the last revision it successfully processed. This mechanism ensures that the client receives all events that occurred during the disconnection period.
The next thing you’ll likely encounter is handling application-level timeouts for operations that depend on the watched state.