CockroachDB’s crdb_internal schema is a goldmine for understanding what’s happening under the hood, but it’s not always obvious how to use it effectively.
Let’s say you’re trying to figure out why a specific query is slow, or why a certain transaction is getting stalled. You can dive into crdb_internal.cluster_transactions to see active transactions, or crdb_internal.node_liveness to check if nodes are healthy.
Here’s a quick look at crdb_internal.node_liveness in action. Imagine you run this on one of your nodes:
SELECT node_id, num_heartbeats, last_heartbeat, is_available
FROM crdb_internal.node_liveness
WHERE node_id = 1;
You might see output like this:
node_id | num_heartbeats | last_heartbeat | is_available
---------+----------------+---------------------------+----------------
1 | 1234 | 2023-10-27 10:30:00+00:00 | t
(1 row)
This tells you node 1 is alive and kicking. If is_available was f (false), that’s your first clue that something is wrong with that specific node.
Now, let’s say you’re investigating a slow query. You can use crdb_internal.exec_ செயல_stats to get detailed performance metrics for queries. This table collects statistics about query execution, including latency, rows read, and plan details.
To see the top 5 slowest queries by average latency over the last hour, you’d run:
SELECT
query,
avg(total_latency) AS avg_latency,
count(*) AS execution_count
FROM
crdb_internal.exec_ செயல_stats
WHERE
start_time >= NOW() - INTERVAL '1 hour'
GROUP BY
query
ORDER BY
avg_latency DESC
LIMIT 5;
This query helps pinpoint which statements are consistently taking too long. The total_latency column is the key here, representing the total time spent executing a specific query instance.
Understanding the crdb_internal tables allows you to build a comprehensive mental model of your cluster’s behavior. For instance, crdb_internal.ranges provides information about data distribution. You can query it to see which ranges are on which nodes, their sizes, and their status.
SELECT
r.range_id,
r.start_key,
r.end_key,
r.lease_holder,
n.address
FROM
crdb_internal.ranges AS r
JOIN
crdb_internal.nodes AS n ON r.lease_holder = n.node_id
WHERE
r.database_name = 'your_database' AND r.table_name = 'your_table'
LIMIT 10;
This helps you understand data locality and potential bottlenecks related to data placement. The lease_holder column shows which node currently holds the lease for that range, making it the primary point of contact for operations on that data.
Another crucial table is crdb_internal.kv_operation_estimates. This one is a bit more advanced, showing estimates for the cost of various key-value operations. It’s less about real-time monitoring and more about understanding the potential cost of operations.
If you’re debugging issues related to transaction contention, crdb_internal.transaction_contention is your friend. It surfaces transactions that are currently blocked or have recently been blocked due to contention on locks.
SELECT
txn_id,
blocking_txn_id,
lock_type,
wait_start_time
FROM
crdb_internal.transaction_contention
WHERE
wait_start_time >= NOW() - INTERVAL '15 minutes'
ORDER BY
wait_start_time DESC;
This can reveal patterns of contention, like a long-running transaction holding locks that prevent other transactions from proceeding.
The one thing most people don’t realize about crdb_internal tables is that many of them are dynamically generated or aggregated from distributed data. When you query crdb_internal.exec_ செயல_stats or crdb_internal.ranges, the information isn’t necessarily residing on the node you’re connected to; it’s being gathered from across the cluster. This means queries against these tables can themselves incur network overhead and processing, especially on very large clusters. It’s a distributed system querying a distributed system.
Once you’ve mastered crdb_internal for debugging, you’ll likely want to explore how to use the SQL API for programmatic cluster management.