Debezium doesn’t just read database changes; it transforms them into a stream of immutable events that can be processed by other systems.

Let’s see Debezium in action with PostgreSQL. Imagine we have a simple users table:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) NOT NULL UNIQUE,
    email VARCHAR(100)
);

INSERT INTO users (username, email) VALUES ('alice', 'alice@example.com');

Now, let’s set up Debezium to monitor this table. We’ll use Kafka Connect, a distributed system for streaming data between Kafka and other systems.

First, ensure you have Kafka and Kafka Connect running. Then, deploy the Debezium PostgreSQL connector. This typically involves downloading the connector JAR and placing it in Kafka Connect’s plugin.path.

The connector configuration looks like this:

{
  "name": "users-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "your_postgres_host",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "debezium_password",
    "database.dbname": "your_database_name",
    "database.server.name": "dbserver1",
    "plugin.name": "pgoutput",
    "table.include.list": "public.users",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false"
  }
}

Once this connector is deployed and running in Kafka Connect, any changes to the public.users table will be published as events to a Kafka topic named dbserver1.public.users.

Let’s simulate a change:

UPDATE users SET email = 'alice.updated@example.com' WHERE username = 'alice';

When this UPDATE occurs, Debezium captures the change from PostgreSQL’s logical replication stream. It then generates a Kafka message. The value part of this message, if we were to inspect it using a Kafka consumer, would look something like this (simplified JSON):

{
  "before": {
    "id": 1,
    "username": "alice",
    "email": "alice@example.com"
  },
  "after": {
    "id": 1,
    "username": "alice",
    "email": "alice.updated@example.com"
  },
  "source": {
    "version": "1.9.7.Final",
    "connector": "postgresql",
    "name": "dbserver1",
    "ts_ms": 1678886400000,
    "snapshot": "false",
    "db_name": "your_database_name",
    "schema": "public",
    "table": "users",
    "tx_id": 1234,
    "lsn": "0/12345678",
    "xmin": 567
  },
  "op": "u",
  "ts_ms": 1678886401000
}

Here’s the breakdown of what this event signifies:

  • op: The operation that occurred (c for create, u for update, d for delete, r for read/snapshot).
  • before: The state of the row before the change.
  • after: The state of the row after the change.
  • source: Metadata about the origin of the change, including the database, schema, table, and transaction details.
  • ts_ms: Timestamps indicating when the change was recorded by the database and when Debezium processed it.

This event-driven model allows you to build reactive systems. For instance, you could have another Kafka consumer that listens to the dbserver1.public.users topic and, upon receiving an UPDATE event for a user, triggers an email notification to the user’s new email address. Or, you could use this stream to populate a search index, update a data warehouse, or trigger microservice workflows.

The core problem Debezium solves is bridging the gap between transactional databases and event-streaming platforms, enabling real-time data synchronization and event-driven architectures without altering your application code or requiring complex polling mechanisms. It leverages the database’s own transaction logs (like PostgreSQL’s Write-Ahead Log or WAL) for a reliable, low-latency capture.

A crucial detail that often trips people up is the snapshot mode. When you first start a Debezium connector, it needs to establish a baseline of your existing data. This is done via a "snapshot." The snapshot.mode configuration property controls this. Common values are initial (take a snapshot and then start streaming changes), when_needed (take a snapshot only if no snapshot has been taken before), never (assume a snapshot has already been taken and just start streaming), and schema_only (snapshot only schema changes). If you’re migrating an existing database and want to ensure your downstream systems have the current state before you start processing live changes, initial is often the right choice. But be aware: a full snapshot on a large table can be resource-intensive and take a significant amount of time, during which Debezium will also be capturing ongoing changes.

The next concept you’ll likely encounter is handling schema changes.

Want structured learning?

Take the full Event-driven course →