CockroachDB is failing because its nodes can’t agree on the current time, which is critical for its distributed consensus.
Cause 1: NTP Daemon Not Running
Diagnosis:
Check the status of the chronyd or ntpd service on each node.
sudo systemctl status chronyd
# or
sudo systemctl status ntpd
Fix: If the service is inactive, start and enable it.
sudo systemctl start chronyd
sudo systemctl enable chronyd
# or
sudo systemctl start ntpd
sudo systemctl enable ntpd
This ensures that the system’s clock is actively synchronized with external NTP servers.
Cause 2: NTP Server Misconfiguration
Diagnosis:
Inspect the NTP client configuration file for incorrect or unreachable server entries.
For chronyd: /etc/chrony/chrony.conf
For ntpd: /etc/ntp.conf
Example chrony.conf snippet to check:
# Look for lines like this, ensure servers are reachable and valid
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
Fix: Replace invalid or unreachable servers with known good ones, ideally geographically diverse. Ensure your firewall allows outbound UDP traffic on port 123.
# Example modification in /etc/chrony/chrony.conf
server ntp.ubuntu.com iburst
server pool.ntp.org iburst
After editing, restart the NTP service:
sudo systemctl restart chronyd
This forces the client to use reliable time sources, improving synchronization accuracy.
Cause 3: Firewall Blocking NTP Traffic
Diagnosis:
Verify that UDP port 123 (NTP) is open for outbound connections from your CockroachDB nodes to your NTP servers.
Use tcpdump on a node to see if NTP packets are leaving.
sudo tcpdump -i any udp port 123 -n
If you see packets going out, the firewall is likely not the issue. If not, it is.
Fix:
Configure your firewall (e.g., iptables, firewalld, or cloud provider security groups) to allow outbound UDP traffic on port 123.
For firewalld:
sudo firewall-cmd --add-service=ntp --permanent
sudo firewall-cmd --reload
For iptables:
sudo iptables -A OUTPUT -p udp --dport 123 -j ACCEPT
# Save rules if necessary, e.g., with iptables-persistent
Allowing NTP traffic ensures that time synchronization packets can reach the external servers and return.
Cause 4: Insufficient NTP Server Reachability/Quorum
Diagnosis:
Check the output of chronyc sources or ntpq -p for the status of your configured NTP servers. Look for servers with ^* (synchronized) or + (candidate) status. If most are x (reject) or ? (unreachable), you have a problem.
chronyc sources
Example of bad output:
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last Сервер
==============================================================================
? 192.168.1.1 2 6 0 - - +0ns[ +0ns] +/- 0ns
? 192.168.1.2 2 6 0 - - +0ns[ +0ns] +/- 0ns
x 1.pool.ntp.org 2 6 3 6 6 -250ms[ -250ms] +/- 20ms
x 2.pool.ntp.org 2 6 3 8 8 -300ms[ -300ms] +/- 25ms
Fix:
Configure at least 3-4 NTP servers. Use a mix of local (if available and reliable) and geographically diverse public NTP servers from pools like pool.ntp.org.
Ensure your NTP client is configured to allow a reasonable number of sources.
# In /etc/chrony/chrony.conf
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
server 3.pool.ntp.org iburst
# Add a local server if you have one
# server 127.127.1.0 # Local clock, usually not recommended for critical sync
Restart the NTP service after changes. Having more reliable time sources increases the chance of achieving a stable and accurate time synchronization.
Cause 5: NTP Daemon Over-Reliance on Specific Servers
Diagnosis:
Examine the chronyc sources or ntpq -p output. If a single server dominates synchronization despite others being available, it might be a point of failure. Look for a single server with ^* and others with ? or x.
Fix:
Adjust NTP client configuration to give more weight to a diverse set of servers, or to be more aggressive in switching sources if the primary becomes unreliable. For chronyd, you can adjust maxdist and maxfreq to be more tolerant of small deviations. However, the primary fix is ensuring multiple good sources.
In chrony.conf, consider adding:
# Allow a wider initial drift
makestep 10 3
Restart the NTP service. This helps the client dynamically adapt to varying clock drift by considering a broader range of time sources.
Cause 6: System Clock Drift Exceeding CockroachDB’s Tolerance
Diagnosis:
CockroachDB has a default tolerance for clock skew between nodes, typically 100ms. Check the actual clock difference between nodes using date.
# On node 1
date +%s.%N
# On node 2
date +%s.%N
# Calculate the difference
If the difference consistently exceeds 100ms, your NTP is not keeping up sufficiently.
Fix:
Ensure your NTP client is configured to synchronize frequently. For chronyd, the default poll interval is usually acceptable, but you can explicitly set it lower for critical systems if needed.
# In /etc/chrony/chrony.conf
# Make poll intervals smaller (e.g., 64 seconds for fast sync)
# Adjust these carefully, too frequent can overload servers or cause instability
# Default is often 64s for polling, 1024s for max.
# Consider tuning if drift is persistent.
# server ntp.example.com iburst minpoll 4 maxpoll 10
The most common fix is ensuring robust NTP server configuration (Causes 2-4) rather than aggressive client polling. A well-synced system clock will remain within CockroachDB’s acceptable skew limits.
The next error you’ll likely encounter after fixing clock skew is related to transaction retries due to contention, as the system now has a consistent view of time and can more accurately detect concurrent operations.