Transaction contention is when multiple transactions try to access the same data simultaneously, and one or more of them have to wait, slowing down or blocking the entire system.
Let’s see a real-world example. Imagine two transactions, T1 and T2, both trying to update the same row in a users table.
-- Transaction T1
BEGIN;
UPDATE users SET balance = balance - 10 WHERE id = 123;
-- ... other operations ...
COMMIT;
-- Transaction T2
BEGIN;
UPDATE users SET balance = balance + 10 WHERE id = 123;
-- ... other operations ...
COMMIT;
If T1 starts first and acquires a lock on the row id = 123, T2 will block when it tries to update the same row. T2 will wait until T1 commits or rolls back. If T1 is very long-running, T2 can remain blocked indefinitely, leading to a "transaction aborted" error if it exceeds the deadlock retry limit. This is contention.
The most common cause of transaction contention is long-running transactions holding locks.
- Diagnosis: Use
SHOW CLUSTER METRICSand filter fortxn_livenessrelated metrics, specificallytxn_liveness.max_refresh_interval. Also, querySHOW TRANSACTION PURGE. - Fix: Identify and terminate long-running transactions that are not making progress. Use
SHOW TRANSACTIONSto find them, thenCANCEL <transaction_id>. - Why it works: Long-running transactions hold locks longer, increasing the probability of other transactions needing those locks. Canceling them releases the locks promptly.
Another frequent culprit is frequent updates to the same hot rows.
- Diagnosis: Observe
crdb_internal.kv_livenessandcrdb_internal.mvcc_statsfor specific rows or ranges with high read/write counts and significantly highermax_txns_in_flightormax_txns_total. A simpleSELECT count(*) FROM your_table WHERE id = <hot_id>run in a tight loop can reveal this. - Fix: Design your schema to distribute writes across more rows. For example, use a range of
idvalues instead of a single one, or employ techniques like sharding or using a UUID for primary keys to spread out writes. - Why it works: By spreading writes across more rows or even different tables, you reduce the chances of multiple transactions targeting the exact same data.
A subtle but common issue is implicit transactions that run longer than expected.
- Diagnosis: Check
SHOW TRANSACTIONSfor transactions where theuserisrootand theclient_addressindicates an application server, and thedurationis unexpectedly high. Look for statements likeSELECTthat, when run in implicit transactions, can hold locks until the next statement. - Fix: Explicitly manage transactions using
BEGIN,COMMIT, andROLLBACK. Ensure all operations within a transaction are necessary and complete quickly. - Why it works: Implicit transactions auto-commit after each statement, but if a statement is slow or there’s a network delay before the next statement, locks can be held longer than anticipated. Explicit transactions give you control.
Deadlocks, a specific type of contention, are also a major pain point.
- Diagnosis: CockroachDB automatically detects and resolves deadlocks. When a deadlock occurs, one transaction will be aborted. Look for "transaction aborted" errors in your application logs, often with a message indicating a deadlock. The
SHOW DEADLOCKScommand (available in newer versions) can also provide details. - Fix: Ensure transactions acquire locks in a consistent order. For example, if two transactions update records A and B, both should always update A then B, or always B then A.
- Why it works: Consistent lock ordering prevents circular dependencies where T1 waits for T2, and T2 waits for T1.
High read load combined with writes can exacerbate contention.
- Diagnosis: Monitor
crdb_internal.kv_node_metricsforreadsandwrites, and checkcrdb_internal.mvcc_statsforintent_rowsandmax_txns_in_flight. High values on these, especially correlated with application-level latency spikes, point to this. - Fix: Optimize read queries to be as fast as possible. Use appropriate indexes, avoid full table scans, and consider caching frequently accessed, rarely changing data outside the database.
- Why it works: Reads, even if not directly conflicting, can still contribute to the overall load and indirectly increase contention by delaying write operations or making them wait longer for the transaction manager.
Insufficiently sized or configured nodes can lead to general performance degradation that manifests as contention.
- Diagnosis: Check node resource utilization (CPU, memory, network, disk I/O) using
SHOW METRICSand your infrastructure monitoring tools. Look for sustained high CPU or disk latency. - Fix: Scale up your cluster by adding more nodes, increasing node resources, or optimizing disk configuration (e.g., faster SSDs).
- Why it works: When nodes are overloaded, all operations, including lock acquisition and release, take longer, increasing the window for contention to occur.
Finally, network latency between nodes can impact how quickly locks are propagated and acknowledged.
- Diagnosis: Use tools like
pingandtraceroutebetween your CockroachDB nodes to check for high latency or packet loss. Monitor network traffic on your nodes. - Fix: Improve network connectivity between nodes. Ensure nodes are in the same low-latency network zone if possible.
- Why it works: Increased latency means it takes longer for lock requests to reach their destination and for acknowledgments to return, effectively slowing down transaction processing and increasing the chance of conflicts.
After resolving transaction contention, you might encounter connection refused errors if your application is not properly configured to handle transient network issues or retries.