CockroachDB changefeeds, when streaming to Kafka, aren’t just replicating data; they’re fundamentally changing how you think about distributed database state.

Let’s see it in action. Imagine you have a simple users table:

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username STRING UNIQUE NOT NULL,
    email STRING,
    created_at TIMESTAMPTZ DEFAULT now()
);

Now, you want to capture every change to this table and send it to a Kafka topic named user_changes. You’d set up a changefeed like this:

CREATE CHANGEFEED FOR users
INTO 'kafka://kafka-broker-1:9092,kafka-broker-2:9092'
WITH (
    topic = 'user_changes',
    key_format = 'json',
    value_format = 'json'
);

When you insert a new user:

INSERT INTO users (username, email) VALUES ('alice', 'alice@example.com');

A message appears in your user_changes Kafka topic. The key might look like this (JSON):

{
  "user_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef"
}

And the value will contain the full row data as it was after the insert:

{
  "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "username": "alice",
  "email": "alice@example.com",
  "created_at": "2023-10-27T10:30:00.123456Z"
}

If you update that user:

UPDATE users SET email = 'alice.smith@example.com' WHERE username = 'alice';

A new message arrives in Kafka. The key is the same, but the value reflects the new state:

{
  "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "username": "alice",
  "email": "alice.smith@example.com",
  "created_at": "2023-10-27T10:30:00.123456Z"
}

If you delete the user:

DELETE FROM users WHERE username = 'alice';

The value will now indicate the deletion. The value_format determines the exact structure, but typically, a before and after field is present, with after being null for deletes.

{
  "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "username": "alice",
  "email": "alice.smith@example.com",
  "created_at": "2023-10-27T10:30:00.123456Z"
}

(Note: The before field would contain the state before deletion; after would be null).

The core problem changefeeds solve is the challenge of reliably and efficiently capturing transactional changes from a distributed database and making them available for downstream consumption in near real-time. Traditional methods, like polling or batch ETL, introduce latency, complexity, and can put significant load on the database. Changefeeds, built into CockroachDB’s transactional logging, directly tap into the commit stream.

Internally, a changefeed works by reading from CockroachDB’s distributed transaction log. When a transaction commits, its changes are written to this log. The changefeed process, running as a set of background jobs within CockroachDB, monitors this log. It reconstructs the row-level changes from the log entries, transforms them according to the specified formats (json, protobuf, avro), and then publishes them to the configured Kafka topic. The key_format and value_format options are crucial for defining how the data is serialized. For Kafka, key_format='json' and value_format='json' are common, producing standard JSON messages. The key is typically derived from the primary key of the row, allowing downstream consumers to easily identify which record was modified.

The WITH clause offers fine-grained control. topic specifies the Kafka topic name. confluent.schema.registry can be used for Avro serialization. envelope controls whether to include metadata like timestamp, txnid, span, and resolved. resolved is particularly interesting; if enabled, it sends periodic "heartbeat" messages indicating the furthest point in the transaction log the changefeed has processed, even if no data changes occurred. This is vital for consumers to understand their progress and for coordinating distributed transactions.

What most people don’t realize is how the resolved option fundamentally changes the semantics of the changefeed for consumers. Without resolved, a consumer might see a series of data messages and assume they represent a complete set of changes up to that point. However, if the changefeed process experienced a temporary slowdown or network hiccup, the consumer could be missing intermediate transactions. By enabling resolved and processing these special messages, a consumer can guarantee exactly-once processing semantics by only committing offsets after a resolved message has been received for a particular timestamp. This is because resolved messages act as checkpoints, confirming that all data messages before that timestamp have been emitted.

The next logical step is to explore how to consume these changefeeds reliably in your applications, particularly focusing on exactly-once processing guarantees.

Want structured learning?

Take the full Cockroachdb course →