Transaction contention is when multiple transactions try to access the same data simultaneously, and one or more of them have to wait, slowing down or blocking the entire system.

Let’s see a real-world example. Imagine two transactions, T1 and T2, both trying to update the same row in a users table.

-- Transaction T1
BEGIN;
UPDATE users SET balance = balance - 10 WHERE id = 123;
-- ... other operations ...
COMMIT;

-- Transaction T2
BEGIN;
UPDATE users SET balance = balance + 10 WHERE id = 123;
-- ... other operations ...
COMMIT;

If T1 starts first and acquires a lock on the row id = 123, T2 will block when it tries to update the same row. T2 will wait until T1 commits or rolls back. If T1 is very long-running, T2 can remain blocked indefinitely, leading to a "transaction aborted" error if it exceeds the deadlock retry limit. This is contention.

The most common cause of transaction contention is long-running transactions holding locks.

  • Diagnosis: Use SHOW CLUSTER METRICS and filter for txn_liveness related metrics, specifically txn_liveness.max_refresh_interval. Also, query SHOW TRANSACTION PURGE.
  • Fix: Identify and terminate long-running transactions that are not making progress. Use SHOW TRANSACTIONS to find them, then CANCEL <transaction_id>.
  • Why it works: Long-running transactions hold locks longer, increasing the probability of other transactions needing those locks. Canceling them releases the locks promptly.

Another frequent culprit is frequent updates to the same hot rows.

  • Diagnosis: Observe crdb_internal.kv_liveness and crdb_internal.mvcc_stats for specific rows or ranges with high read/write counts and significantly higher max_txns_in_flight or max_txns_total. A simple SELECT count(*) FROM your_table WHERE id = <hot_id> run in a tight loop can reveal this.
  • Fix: Design your schema to distribute writes across more rows. For example, use a range of id values instead of a single one, or employ techniques like sharding or using a UUID for primary keys to spread out writes.
  • Why it works: By spreading writes across more rows or even different tables, you reduce the chances of multiple transactions targeting the exact same data.

A subtle but common issue is implicit transactions that run longer than expected.

  • Diagnosis: Check SHOW TRANSACTIONS for transactions where the user is root and the client_address indicates an application server, and the duration is unexpectedly high. Look for statements like SELECT that, when run in implicit transactions, can hold locks until the next statement.
  • Fix: Explicitly manage transactions using BEGIN, COMMIT, and ROLLBACK. Ensure all operations within a transaction are necessary and complete quickly.
  • Why it works: Implicit transactions auto-commit after each statement, but if a statement is slow or there’s a network delay before the next statement, locks can be held longer than anticipated. Explicit transactions give you control.

Deadlocks, a specific type of contention, are also a major pain point.

  • Diagnosis: CockroachDB automatically detects and resolves deadlocks. When a deadlock occurs, one transaction will be aborted. Look for "transaction aborted" errors in your application logs, often with a message indicating a deadlock. The SHOW DEADLOCKS command (available in newer versions) can also provide details.
  • Fix: Ensure transactions acquire locks in a consistent order. For example, if two transactions update records A and B, both should always update A then B, or always B then A.
  • Why it works: Consistent lock ordering prevents circular dependencies where T1 waits for T2, and T2 waits for T1.

High read load combined with writes can exacerbate contention.

  • Diagnosis: Monitor crdb_internal.kv_node_metrics for reads and writes, and check crdb_internal.mvcc_stats for intent_rows and max_txns_in_flight. High values on these, especially correlated with application-level latency spikes, point to this.
  • Fix: Optimize read queries to be as fast as possible. Use appropriate indexes, avoid full table scans, and consider caching frequently accessed, rarely changing data outside the database.
  • Why it works: Reads, even if not directly conflicting, can still contribute to the overall load and indirectly increase contention by delaying write operations or making them wait longer for the transaction manager.

Insufficiently sized or configured nodes can lead to general performance degradation that manifests as contention.

  • Diagnosis: Check node resource utilization (CPU, memory, network, disk I/O) using SHOW METRICS and your infrastructure monitoring tools. Look for sustained high CPU or disk latency.
  • Fix: Scale up your cluster by adding more nodes, increasing node resources, or optimizing disk configuration (e.g., faster SSDs).
  • Why it works: When nodes are overloaded, all operations, including lock acquisition and release, take longer, increasing the window for contention to occur.

Finally, network latency between nodes can impact how quickly locks are propagated and acknowledged.

  • Diagnosis: Use tools like ping and traceroute between your CockroachDB nodes to check for high latency or packet loss. Monitor network traffic on your nodes.
  • Fix: Improve network connectivity between nodes. Ensure nodes are in the same low-latency network zone if possible.
  • Why it works: Increased latency means it takes longer for lock requests to reach their destination and for acknowledgments to return, effectively slowing down transaction processing and increasing the chance of conflicts.

After resolving transaction contention, you might encounter connection refused errors if your application is not properly configured to handle transient network issues or retries.

Want structured learning?

Take the full Cockroachdb course →