Distributed locks are a critical mechanism for ensuring that only one process or node can access a shared resource at a time. However, a common and dangerous failure mode occurs when a lock holder crashes or becomes unresponsive after acquiring the lock but before releasing it. This leaves the lock "stale" or "held" indefinitely, preventing any other process from accessing the resource, effectively causing a denial-of-service. Fencing tokens are the standard solution to prevent stale distributed lock holders from causing damage.
Let’s see this in action with a hypothetical distributed lock service that uses Redis. Imagine we have a service that manages a critical configuration file. Multiple application instances might try to update this file, but only one should succeed at a time.
Here’s a simplified view of how a client might interact with a distributed lock service, and where fencing tokens come into play.
import redis
import time
import uuid
# Assume this is your distributed lock client
class DistributedLock:
def __init__(self, redis_client, lock_name):
self.redis = redis_client
self.lock_name = lock_name
self.lock_key = f"lock:{lock_name}"
self.token_key = f"lock:{lock_name}:token"
self.owner_id = None # Unique ID for this lock holder instance
def acquire(self, ttl_seconds=10):
self.owner_id = str(uuid.uuid4())
# Try to acquire the lock. SETNX with EX is atomic.
# We also store our unique owner_id and a fencing token.
# The fencing token is initialized to 0 and incremented on each lock acquisition.
# This initial acquisition will get token 0 if successful.
acquired = self.redis.set(
self.lock_key,
f"{self.owner_id}:0", # Store owner_id and initial fencing token
nx=True,
ex=ttl_seconds
)
if acquired:
print(f"[{self.owner_id}] Acquired lock '{self.lock_name}' with token 0.")
return True
print(f"[{self.owner_id}] Failed to acquire lock '{self.lock_name}'.")
return False
def release(self):
if not self.owner_id:
return False
# Retrieve the current owner and token
current_lock_value = self.redis.get(self.lock_key)
if current_lock_value:
current_owner, current_token_str = current_lock_value.decode().split(':')
current_token = int(current_token_str)
# Only release if we are the current owner
if current_owner == self.owner_id:
# We don't need to increment the token on release,
# the increment happens on acquisition.
self.redis.delete(self.lock_key)
print(f"[{self.owner_id}] Released lock '{self.lock_name}'.")
self.owner_id = None
return True
print(f"[{self.owner_id}] Lock '{self.lock_name}' not held by me or already expired/stale.")
return False
def is_stale(self):
# This is a simplified check. A real system would have a more robust way
# to check if the lock is truly stale and needs to be forcibly released.
# For demonstration, we'll just check if the lock key exists.
return self.redis.exists(self.lock_key) == 0
# --- Simulation ---
r = redis.Redis(decode_responses=True)
lock_name = "config_file_lock"
# Instance 1 acquires the lock
instance1 = DistributedLock(r, lock_name)
instance1.acquire(ttl_seconds=5) # Lock expires in 5 seconds
# Simulate instance 1 crashing *after* acquiring the lock but *before* releasing it.
# The lock will remain in Redis for 5 seconds.
# Let's say instance 1 experiences a network partition and its process continues to run,
# thinking it still holds the lock. It might try to write to the config file.
# Meanwhile, another instance, instance 2, tries to acquire the lock.
# It will fail because instance 1's lock is still active.
instance2 = DistributedLock(r, lock_name)
print("\nInstance 2 trying to acquire lock...")
if not instance2.acquire(ttl_seconds=5):
print("Instance 2 waiting for lock to be released or expire...")
time.sleep(6) # Wait for instance 1's lock to expire
print("Instance 2 trying to acquire lock again after expiration...")
if instance2.acquire(ttl_seconds=5):
print("Instance 2 acquired the lock.")
# Instance 2 can now safely update the config file.
# But what if instance 1 *wasn't* actually crashed and *did* eventually release the lock?
# Or worse, what if instance 1 thought it lost the lock, but the lock *didn't* expire
# (e.g., due to clock skew or a bug in the TTL mechanism)?
# This is where fencing tokens save us.
# Let's rewind and introduce the fencing token mechanism properly.
# The lock value will be "owner_id:fencing_token".
# The fencing token is an ever-increasing integer.
def acquire_with_fencing(redis_client, lock_name, ttl_seconds=10):
owner_id = str(uuid.uuid4())
lock_key = f"lock:{lock_name}"
token_key = f"lock:{lock_name}:token" # A separate key to store the current token
# Get the current fencing token, initialize to 0 if it doesn't exist
current_fencing_token = redis_client.get(token_key)
if current_fencing_token is None:
# Use a Lua script for atomic get-or-set-to-0
lua_script = """
if redis.call('exists', KEYS[1]) == 0 then
redis.call('set', KEYS[1], '0')
return 0
else
return tonumber(redis.call('get', KEYS[1]))
end
"""
current_fencing_token = redis_client.eval(lua_script, 1, token_key)
else:
current_fencing_token = int(current_fencing_token)
# Prepare the value to set: owner_id and the *next* fencing token
new_fencing_token = current_fencing_token + 1
lock_value = f"{owner_id}:{new_fencing_token}"
# Try to acquire the lock using SET NX EX
acquired = redis_client.set(lock_key, lock_value, nx=True, ex=ttl_seconds)
if acquired:
# If lock acquired, update the fencing token key to the new token
redis_client.set(token_key, new_fencing_token)
print(f"[{owner_id}] Acquired lock '{lock_name}' with fencing token {new_fencing_token}.")
return owner_id, new_fencing_token
else:
print(f"[{owner_id}] Failed to acquire lock '{lock_name}'.")
return None, None
def release_with_fencing(redis_client, lock_name, owner_id):
lock_key = f"lock:{lock_name}"
current_lock_value = redis_client.get(lock_key)
if current_lock_value:
current_owner, _ = current_lock_value.decode().split(':')
if current_owner == owner_id:
redis_client.delete(lock_key)
print(f"[{owner_id}] Released lock '{lock_name}'.")
return True
print(f"[{owner_id}] Lock '{lock_name}' not held by me or already stale.")
return False
def write_to_resource(redis_client, lock_name, fencing_token):
# This function simulates writing to the protected resource.
# It MUST check the fencing token provided by the lock service.
resource_key = f"resource:{lock_name}"
stored_token = redis_client.get(f"{lock_name}:last_write_token")
if stored_token is None or fencing_token > int(stored_token):
print(f"Writing to resource '{lock_name}' with fencing token {fencing_token}...")
redis_client.set(resource_key, f"data_written_by_token_{fencing_token}")
redis_client.set(f"{lock_name}:last_write_token", fencing_token) # Store the token used for this write
print("Write successful.")
return True
else:
print(f"Stale write attempt to resource '{lock_name}' blocked. Provided token {fencing_token} <= stored token {stored_token}.")
return False
# --- Simulation with Fencing Tokens ---
r = redis.Redis(decode_responses=True)
lock_name = "critical_data"
# Instance A acquires the lock
owner_a, token_a = acquire_with_fencing(r, lock_name, ttl_seconds=5)
# Simulate Instance A crashing or becoming unresponsive *after* acquiring the lock.
# The lock will expire in Redis after 5 seconds.
# BUT, imagine Instance A's network connection drops, it doesn't realize the lock expired,
# and it proceeds to "write" to the critical data.
# Let's force a delay and then simulate Instance A trying to write *after* its lock has expired.
print("\nSimulating Instance A continuing to operate after its lock expired...")
time.sleep(6) # Wait for lock to expire
# Instance A tries to write to the resource, using the token it *thought* it had.
# This is the dangerous part: it's operating with stale information.
print("Instance A attempting to write to resource...")
write_to_resource(r, lock_name, token_a) # This write should ideally be rejected if the system is fencing-aware.
# Now, Instance B acquires the lock *after* Instance A's lock expired.
print("\nInstance B trying to acquire the lock...")
owner_b, token_b = acquire_with_fencing(r, lock_name, ttl_seconds=5)
# Instance B successfully acquires the lock and gets a *new, higher* fencing token.
# Now, Instance B can safely write to the resource.
print("Instance B attempting to write to resource...")
write_to_resource(r, lock_name, token_b)
# What if Instance A *didn't* actually crash but just had a temporary network blip,
# and now it *thinks* it still holds the lock (because its internal state says so)
# and tries to write *again*?
# If the lock service's `write_to_resource` function is fencing-aware, it will prevent this.
# The `write_to_resource` function above checks `fencing_token > stored_token`.
# Instance A's `token_a` will be less than `token_b`, so its write will be rejected.
# The core idea is that every operation that modifies the shared resource must be
# passed the fencing token that was current *at the time the lock was acquired*.
# The resource itself (or a layer protecting it) is responsible for verifying that
# this token is greater than any token previously used for a write operation.
# The lock service is responsible for generating these ever-increasing tokens.
# The lock service itself (e.g., ZooKeeper, etcd, or a Redis-based implementation)
# needs to support atomic operations for acquiring locks and managing the fencing token.
# In ZooKeeper, this is achieved via sequential ephemeral znodes. The sequence number
# of the znode serves as the fencing token. When a client acquires a lock, it creates
# a node like `/locks/mylock/lock-`, and ZooKeeper appends a sequence number, e.g.,
# `/locks/mylock/lock-0000000001`. This sequence number is the fencing token.
# If the client disconnects, its ephemeral node is deleted. A new acquisition will
# get a higher sequence number. The client writing to the resource checks if its
# sequence number is still the highest one seen for that resource.
# In etcd, the `ModRevision` of the key associated with the lock can act as a fencing token.
# When a client acquires a lock, it gets the current `ModRevision`. If the client
# crashes and comes back, it can check if the `ModRevision` of the lock key is still
# the one it recorded. If the lock expired and was re-acquired by someone else, the
# `ModRevision` will be higher.
# The most critical part is that the *client* trying to perform an action must
# communicate its fencing token to the *service* that performs the action.
# The service then uses this token to ensure it's not performing an action based on
# stale information. This prevents a crashed process from overwriting data written
# by a newly elected leader.
# The next problem you'll likely encounter is ensuring the fencing token generation
# itself is robust and doesn't suffer from race conditions, especially if you're
# implementing it yourself without a dedicated distributed coordination service.