A CockroachDB sequence is failing because the underlying Raft consensus group for its table is experiencing high latency, preventing it from reliably issuing new unique identifiers.
Cause 1: High Network Latency/Packet Loss
Diagnosis:
Run cockroach demo --nodes 3 --demo-locality US/DC --demo-locality US/NY --demo-locality US/SF to simulate a multi-region cluster. Then, execute SHOW RANGES FROM TABLE system.sequence_data; and observe the leader and leaseholder for the sequence’s range. If these are in different regions, or if you see frequent leader changes, network issues are likely. You can also use ping <node-ip> from one node to another.
Fix:
Ensure your nodes are in low-latency network environments. For multi-region deployments, co-locate sequence tables with their primary consumers or use a single-region cluster for sequences if possible. If you must use multi-region, consider increasing the raft-election-timeout to 5s (default is 2.5s) in your cockroach start command or cockroach.yaml to give Raft more time to elect a leader across higher latency.
Why it works: Raft requires a quorum of nodes to agree on state changes. High network latency or packet loss between nodes slows down these acknowledgments, making it difficult to maintain a stable leader and achieve consensus, thus delaying sequence increments.
Cause 2: Insufficient Node Resources (CPU/Memory)
Diagnosis: Monitor node CPU and memory utilization via your cluster’s observability tools (e.g., Prometheus, Grafana, or the DB Console’s "Metrics" tab). Look for sustained CPU usage above 80% or memory usage consistently near capacity.
Fix:
Scale up your nodes by increasing their CPU or RAM. For example, if using Docker, you might change docker run ... --cpus="2" --memory="4g" to --cpus="4" --memory="8g". Alternatively, add more nodes to distribute the load.
Why it works: A busy node struggles to process Raft heartbeats and log entries promptly, leading to timeouts and leader instability for any Raft group it participates in, including the sequence’s.
Cause 3: Hotspotting on the Sequence Table
Diagnosis:
In the DB Console, navigate to "Database -> system -> system.sequence_data". Examine the "Ranges" view. If a single range (or a very small number) shows disproportionately high QL_INSERT or SYS_REQ (system requests) latency and transaction counts compared to others, you have a hotspot.
Fix:
CockroachDB v22.1 and later automatically splits ranges. For older versions, or if splitting isn’t keeping up, you might need to manually split the system.sequence_data table’s range. This is generally discouraged for system tables but can be a temporary workaround: ALTER TABLE system.sequence_data SPLIT AT VALUES (some_value);. More practically, ensure your application isn’t generating an overwhelming rate of sequence requests from a single client or region.
Why it works: A hotspot means all sequence increments are hitting a single Raft group. Splitting the range distributes these requests across multiple Raft groups, reducing the load on any single group.
Cause 4: Long-Running Transactions Holding Locks
Diagnosis:
In the DB Console, go to "SQL Activity -> Transactions". Filter for "Active" transactions. Look for any transactions that have been running for an unusually long time, especially those that might be interacting with system.sequence_data indirectly. Check the "Contention" metrics for high max_txns_contended or max_age_of_contended_txns.
Fix:
Identify and terminate long-running transactions that are blocking others. Review application logic to ensure transactions are short and avoid holding locks longer than necessary. If a specific transaction is identified as problematic, use CANCEL <transaction_id>;.
Why it works:
Long-running transactions can hold locks that block writes to system.sequence_data. If these locks prevent Raft followers from acknowledging writes, it can destabilize the leader and cause consensus failures.
Cause 5: Disk I/O Saturation
Diagnosis:
Monitor disk I/O utilization on your nodes. Look for high disk latency, high IOPS, or disk queue lengths consistently exceeding thresholds. Tools like iostat -xz 1 on Linux can show this.
Fix: Upgrade to faster storage (e.g., SSDs instead of HDDs), provision more IOPS if using cloud storage, or distribute the load across more nodes.
Why it works: Raft relies on writing to its log before acknowledging operations. Slow disk I/O delays these writes, causing Raft heartbeats and log replication to fall behind, leading to timeouts and potential leader loss.
Cause 6: Excessive Replication Factor for System Tables
Diagnosis:
Run SHOW RANGES FROM TABLE system.sequence_data; and check the replicas count for the relevant range. The default is 3. If this has been artificially increased (e.g., to 5 in a 3-region cluster for redundancy that’s now causing issues), it could be a factor.
Fix:
Reset the replication factor for the system.sequence_data table to the default. This is typically done by altering the database’s replication zone: ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 3;. You might need to restart nodes for this to fully propagate.
Why it works: A higher replication factor means more nodes must acknowledge writes, increasing the probability of timeouts and leader instability in a distributed, high-latency environment.
The next error you’ll likely encounter is a context deadline exceeded or context canceled error when trying to perform operations that rely on sequence generation, such as inserting rows with auto-generated primary keys.