RabbitMQ clustering for high availability isn’t about replicating queues; it’s about distributing the management of those queues across multiple nodes, so if one node falls over, the others can still see and manage everything.

Let’s see it in action. Imagine two RabbitMQ nodes, rabbit1 and rabbit2.

On rabbit1, we start it normally:

sudo systemctl start rabbitmq-server@rabbit1

And on rabbit2, we’ll do the same, but we need to tell it how to join rabbit1’s cluster. The key is the cluster_formation.peer_discovery_backend and cluster_formation.k8s.external_peer_discovery_ settings in the RabbitMQ configuration file (/etc/rabbitmq/rabbitmq.conf).

For a Kubernetes environment, we’d typically use k8s discovery:

/etc/rabbitmq/rabbitmq.conf on rabbit2:

cluster_formation.enabled = true
cluster_formation.k8s.host = <your_k8s_api_server_host>
cluster_formation.k8s.port = 443
cluster_formation.k8s.token = <service_account_token>
cluster_formation.k8s.namespace = rabbitmq
cluster_formation.k8s.service_name = rabbitmq-internal
cluster_formation.k8s.tls_versions.min = 1.2
cluster_formation.k8s.tls_ca_cert = /path/to/ca.crt
cluster_formation.k8s.tls_cert = /path/to/rabbit.crt
cluster_formation.k8s.tls_key = /path/to/rabbit.key

Then start rabbit2:

sudo systemctl start rabbitmq-server@rabbit2

RabbitMQ will use the Kubernetes API to discover other RabbitMQ pods in the same namespace and service. Once rabbit2 sees rabbit1 (and vice-versa), they form a cluster.

The core problem this solves is single points of failure. If a single RabbitMQ node is running, and that node dies, all your queues, exchanges, and bindings disappear. In a cluster, each node holds a copy of the metadata about the cluster topology. When you declare a queue, that declaration is sent to the cluster, and each node updates its internal view.

Internally, RabbitMQ uses a distributed database called Mnesia to store this metadata. When nodes form a cluster, they sync their Mnesia databases. This sync ensures that every node knows about every other node and all the declared entities.

Crucially, queues themselves are not replicated by default. A queue is assigned to a specific node, which becomes its "owner." Messages are then published to this owner node. For high availability of messages, you need quorum queues or mirrored queues. Quorum queues are the modern, recommended approach. They use the Raft consensus algorithm to replicate messages across multiple nodes.

Let’s say you declare a quorum queue named my-ha-queue with ha-mode: all and ha-params: 3 (meaning 3 replicas are required for it to be considered available).

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

channel.queue_declare(
    queue='my-ha-queue',
    durable=True,
    arguments={
        'x-queue-type': 'quorum',
        'x-replication-factor': 3,
        'x-delivery-mode': 2 # Persistent messages
    }
)
print("Queue 'my-ha-queue' declared with quorum replication.")
connection.close()

If the node owning my-ha-queue goes down, RabbitMQ will automatically promote one of the replica nodes to become the new owner, ensuring message availability.

The one thing most people don’t realize is that even with a clustered RabbitMQ, if you use classic queues and don’t configure mirroring or quorum, your messages are still only on the single node that owns the queue. The clustering only makes the management of those queues highly available, not the message data itself unless you explicitly configure it.

The next problem you’ll likely encounter is understanding how clients should connect to a clustered RabbitMQ, especially during node failures.

Want structured learning?

Take the full Amqp course →