CockroachDB’s built-in dashboards are not just pretty pictures; they’re the fastest way to understand what’s actually happening inside your cluster, often revealing performance bottlenecks before they become outages.
Let’s look at a typical CockroachDB cluster serving traffic. Imagine we’re hitting an API that writes data. We can see the traffic hitting our application servers, but what’s happening within CockroachDB?
Here’s a snapshot of a CockroachDB node’s metrics, as seen in its built-in UI (accessible by default at http://<node-address>:8080/debug/pprof/ui/):
-- Overview Tab --
[Screenshot of CockroachDB UI Overview tab showing key metrics like SQL QPS, KV Ops/sec, Latency, Node Status]
-- SQL Tab --
[Screenshot of CockroachDB UI SQL tab showing Latency (p99, p95, mean), Transactions, Statements]
-- KV (Key-Value) Tab --
[Screenshot of CockroachDB UI KV tab showing Reads, Writes, Latency, Bytes In/Out]
These dashboards aggregate metrics from various internal components. The "SQL" tab shows you the performance of the SQL layer, while the "KV" tab reveals the performance of the underlying distributed key-value store. The "Overview" tab gives you a high-level health check.
The most surprising thing is how directly the KV layer’s performance maps to SQL query performance. If your SQL queries are slow, it’s almost always a bottleneck in the KV layer, which is responsible for distributing and storing all your data across the cluster.
Here’s how the system is structured internally:
- SQL Layer: Receives SQL queries, parses them, generates execution plans, and then translates these into Key-Value (KV) operations.
- KV Layer: The core of CockroachDB. It handles distributed reads and writes to data stored across multiple nodes. Each piece of data is stored as a key-value pair, and these pairs are grouped into "ranges" which are replicated and distributed.
- Replication Layer (Raft): Ensures data consistency and durability by replicating ranges across multiple nodes.
When you execute a SQL query, the SQL layer might break it down into dozens, hundreds, or even thousands of individual KV operations. The latency you experience in your application is the sum of the latencies for all these KV operations, plus the SQL processing time.
Monitoring with Prometheus
While the built-in dashboards are great for a quick glance, for robust, long-term monitoring and alerting, you’ll want to integrate with Prometheus. CockroachDB exposes its metrics in Prometheus format on port 2112 by default (e.g., http://<node-address>:2112/metrics).
Here’s a snippet of what those metrics look like:
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 150
# HELP cockroach_sql_qps SQL queries per second.
# TYPE cockroach_sql_qps counter
cockroach_sql_qps{node_id="1",type="total"} 12345.67
# HELP cockroach_kv_ops Key-value operations per second.
# TYPE cockroach_kv_ops counter
cockroach_kv_ops{node_id="1",type="reads"} 50000.00
cockroach_kv_ops{node_id="1",type="writes"} 10000.00
# HELP cockroach_liveness_heartbeat_latency_nanoseconds Latency of liveness heartbeats.
# TYPE cockroach_liveness_heartbeat_latency_nanoseconds histogram
cockroach_liveness_heartbeat_latency_nanoseconds_bucket{node_id="1",le="1000000"} 100
cockroach_liveness_heartbeat_latency_nanoseconds_bucket{node_id="1",le="10000000"} 200
cockroach_liveness_heartbeat_latency_nanoseconds_sum{node_id="1"} 1500000000
cockroach_liveness_heartbeat_latency_nanoseconds_count{node_id="1"} 300
To get this into Prometheus, you’d configure your prometheus.yml file to scrape these endpoints:
scrape_configs:
- job_name: 'cockroachdb'
static_configs:
- targets: ['node1:2112', 'node2:2112', 'node3:2112']
Once Prometheus is scraping these metrics, you can build dashboards in Grafana. Here are some key metrics to focus on:
- SQL QPS (
cockroach_sql_qps): Total SQL queries per second. A sudden drop can indicate an issue with the SQL layer or upstream application. - KV Ops (
cockroach_kv_ops): Total KV operations per second, broken down by reads and writes. Spikes here often correlate directly with SQL QPS. - KV Latency (
cockroach_kv_latency_nanoseconds): This is crucial. Look at_sumand_countto calculate the average, or use the histogram buckets for percentiles (e.g., p95, p99). High KV latency is the most common cause of slow SQL queries. - Replication RTT (
cockroach_raft_rtt_nanoseconds): Round-trip time for Raft heartbeats. High RTT indicates network or node issues impacting replication. - Store Bytes (
crdb_node_store_bytes): Total bytes on disk per node. Helps identify uneven data distribution or disk full issues.
Consider the cockroach_kv_latency_nanoseconds metric. It’s a histogram, and just looking at the _sum divided by _count gives you the average. However, averages can be misleading. To understand the worst-case experience for your users, you need percentiles. You can calculate these in Grafana using the histogram_quantile function:
histogram_quantile(0.95, sum by (le, node_id) (rate(cockroach_kv_latency_nanoseconds_bucket[5m])))
This query shows the 95th percentile KV latency over the last 5 minutes, aggregated by node. If this value is consistently high (e.g., hundreds of milliseconds), your SQL queries will be slow.
The most counterintuitive aspect of CockroachDB’s performance is how tightly coupled the SQL and KV layers are, and how a single slow KV operation can cascade into slow SQL queries. It’s not just about the number of queries, but the cost of each underlying KV read or write.
The next step is often understanding how to optimize those KV operations, which leads to tuning your schema and SQL queries for better range access patterns.