Add (ha) rabbit cluster (but not use it) ⚠️ #1179

YuryHrytsuk · 2025-08-14T13:03:01Z

What do these changes do?

Add standlone rabbitmq cluster stack.

Next step:

Update simcore service to use new rabbitmq cluster stack (e.g. internal master: switch traffic to cluster rabbit #1197)

FYI: @pcrespov @GitHK

Related issue/s

HA self-hosted RabbitMQ #1176

Related PR/s

master configuration https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1542

Devops Actions ⚠️

create new docker swarm overlay network for rabbit

Prerequisites

Blocked by Update docker engine to major version 26 #1198
- Context: it requires Add support for --detach flag in stack deploy docker/cli#4258
rabbit docker network should be created before deploying rabbit

Checklist

I tested and it works

New stack

The Stack has been included in CI Workflow

New service

If exposed via traefik

~~Service's Public URL is included in maintenance mode~~ --> unrelated
~~Service's Public URL is included in testing mode~~ --> unrelated
Service's has Traefik (Service Loadbalancer) Healthcheck enabled --> haproxy healthcheck is monitoring rabbit nodes
~~Credentials page is updated~~ --> to be updated in another PR when we switch traffic to this rabbit cluster
~~Url added to e2e test services~~ (e2e test checking that URL can be accessed) --> to be done when we swtich traffic

YuryHrytsuk · 2025-08-14T13:06:35Z

TODO

Clients should properly use HA rabbit

configure default replication factor for quorum queues? --> via rabbitmq.conf
how to connect to a multi-node cluster --> it is hidden by haproxy (loadbalancer) --> no changes
for backenders: make sure clients retry connection on failure

Cluster Formation

Source https://www.rabbitmq.com/docs/clustering

Ways of Forming a Cluster

Declaratively by listing cluster nodes in config file <--- we use
~~Declaratively using DNS-based discovery~~
~~Declaratively using AWS (EC2) instance discovery~~
~~Declaratively using Kubernetes discovery~~
~~Declaratively using Consul-based discovery~~
~~Declaratively using etcd-based discovery~~

Node Names (Identifiers)

must be unique --> achieved via docker service name and env variable

Cluster Formation Requirements

every cluster member must be able to resolve hostnames of every other cluster member, its own hostname, as well as machines on which command line tools such as rabbitmqctl might be used --> docker swarm networking

Ports That Must Be Opened for Clustering and Replication --> all works by default in docker swarm (all ports allowed)

4369: epmd, a helper discovery daemon used by RabbitMQ nodes and CLI tools
6000 through 6500: used by RabbitMQ Stream replication
25672: used for inter-node and CLI tools communication and is allocated from a dynamic range (limited to a single port by default, computed as AMQP port + 20000)
35672-35682: used by CLI tools for communication with nodes and is allocated from a dynamic

Nodes in a Cluster

Nodes are Equal Peers

For two nodes to be able to communicate they must have the same shared secret called the Erlang cookie.

Erlang cookie generation should be done at cluster deployment stage ⚠️ --> achieved via common secret

Node Counts and Quorum:

Two node clusters are highly recommended against --> added a test to forbit 2 cluster node configuraiton

Clustering and Clients

Messaging Protocols

In case of a node failure, clients should be able to reconnect to a different node, recover their topology and continue operation --> Task for backenders
Most client libraries accept a list of endpoints --> we use loadbalancer and 1 endpoint

Stream Clients

RabbitMQ Stream protocol clients behave differently from messaging protocols clients --> unrelated for us

Queue and Stream Leader Replica Placement

Queue leaders should be reasonably evenly distributed across cluster nodes (see this doc) --> we have rabbit connections evenly distributed + added comment to do changes if this (haproxy load balancing mechanism) is changed
- https://www.rabbitmq.com/docs/quorum-queues#member-management

Cleaning volumes

Avoid tasks taking unlimited space --> do no retry jobs + always remove stack before starting new tasks
Avoid unexpected volume removal
- Deleting volumes failed but tasks keep running --> do not retry jobs + use timeouts
- Deleting volumes unrelated to rabbit (safeguards) --> added

HA Proxy highly available

running 2+ replicas and statistics --> we do not expose / use statistics at the beginning

+ haproxy extra configuration

matusdrobuliak66

Good job 👍 thanks

mrnicegyu11 · 2025-09-16T07:59:32Z

services/rabbit/.operations.Makefile

+
+validate-NODE_COUNT: guard-NODE_COUNT
+	@if ! echo "$(NODE_COUNT)" | grep --quiet --extended-regexp '^[1-9]$$'; then \
+		echo NODE_COUNT must be a positive single digit integer; \


minor: NODE_COUNT must be a positive single digit integer > 0

mrnicegyu11 · 2025-09-16T08:00:31Z

services/rabbit/.operations.Makefile

+	fi
+
+validate-node-ix0%: .env
+	@if ! echo "$*" | grep --quiet --extended-regexp '^[0-9]+$$'; then \


minor: since you will validate that the integer is >= 1 in a later row, you can also already check that in the regex as such: ^[1-9]+$$

mrnicegyu11 · 2025-09-16T08:00:58Z

services/rabbit/.operations.Makefile

+start-cluster: start-all-nodes start-loadbalancer
+
+update-cluster stop-cluster:
+	@$(error This operation may break cluster. Check README for details.)


I like this dummy target with an error

mrnicegyu11 · 2025-09-16T08:03:58Z

services/rabbit/Makefile

+	envsubst < $< > $@; \
+	echo NODE_INDEX=$* >> $@
+
+.PRECIOUS: docker-compose.node0%.yml


PRECIOUS is a new thing to me, reading from https://www.gnu.org/software/make/manual/html_node/Special-Targets.html I think these could actually be "regular" .PHONY targets, or not? 🤔

mrnicegyu11 · 2025-09-16T08:04:29Z

services/rabbit/Makefile

+	echo NODE_INDEX=$* >> $@
+
+.PRECIOUS: docker-compose.node0%.yml
+docker-compose.node0%.yml: docker-compose.node0x.yml.j2 \


cool stuff with the %, a bit hard to read if one doesnt know makefiles but we are makefile experts :D

mrnicegyu11 · 2025-09-16T08:07:32Z

services/rabbit/docker-compose.node0x.yml.j2

+      start_interval: 10s
+
+volumes:
+  rabbit0{{ NODE_INDEX }}_data:


cool stuff with the looping/templating for multiple nodes.

We used to have a kind-of similar thing for the on-premise minio (was runnning on dalco-prod to provide on-prem S3), you can compare and crosscheck if you want. Maybe there are somethings to find, dont remeber actually https://github.com/ITISFoundation/osparc-ops-environments/blob/8f22a93acf33ec70b55d889e7dae26a4756accdb/services/minio/docker-compose.yaml.j2

mrnicegyu11 · 2025-09-16T08:08:08Z

services/rabbit/docker-compose.node0x.yml.j2

+    deploy:
+      placement:
+        constraints:
+          - node.labels.rabbit0{{ NODE_INDEX }} == true


This will require a docker labels change in osparc-ops-deployment-configuration and associated PRs I guess :)

mrnicegyu11 · 2025-09-16T08:08:56Z

services/rabbit/docker-compose.node0x.yml.j2

+        gid: "999"
+    volumes:
+      - rabbit0{{ NODE_INDEX }}_data:/var/lib/rabbitmq
+    # TODO: sync with existing rabbit attached networks


minor: not sure what this todo actually means, dont fully get it

mrnicegyu11 · 2025-09-16T08:10:15Z

services/rabbit/configs/rabbitmq.conf.j2

@@ -0,0 +1,19 @@
+{% set NODE_IXS = range(1, (RABBIT_CLUSTER_NODE_COUNT | int) + 1) -%}


minor: can we sync this with how rabbit is configured in osparc-simcore, so that the backend dev's setup mimicks the prod one closely?

mrnicegyu11 · 2025-09-16T08:10:38Z

services/rabbit/configs/haproxy.cfg.j2

+# haproxy by default resolves server hostname only once
+# this breaks if container restarts. By using resolvers
+# we tell haproxy to re-resolve the hostname (so container
+# restarts are handled properly)


makes sense, good find

mrnicegyu11 · 2025-09-16T08:11:29Z

services/rabbit/configs/haproxy.cfg.j2

+    balance roundrobin
+
+    option forwardfor
+    http-request set-header X-Forwarded-Port %[dst_port]


out of curioisity: is there a reason this must be set that you remember? due to ha proxy?

mrnicegyu11 · 2025-09-16T08:11:46Z

services/rabbit/configs/haproxy.cfg.j2

+{% for ix in NODE_IXS %}
+    server rabbit0{{ ix }} rabbit-node0{{ ix }}_rabbit0{{ ix }}:{{ RABBIT_MANAGEMENT_PORT }} check resolvers dockerdns init-addr libc,none inter 5s rise 2 fall 3
+{%- endfor %}
+# keep new line in the end to avoid "Missing LF on last line" error


mrnicegyu11 · 2025-09-16T08:13:10Z

services/rabbit/README.md

+
+Source: https://www.rabbitmq.com/docs/next/configure#config-changes-effects
+
+## Enable node Maintenance mode


very good readme, can you write one sentence or link to docs that explain what maintenance mode does?

mrnicegyu11 · 2025-09-16T08:18:15Z

services/rabbit/docker-compose.loadbalancer.yml.j2

+          cpus: "0.1"
+          memory: "128M"
+    healthcheck: # https://stackoverflow.com/a/76513320/12124525
+      test: bash -c 'echo "" > /dev/tcp/127.0.0.1/32087 || exit 1'


mrnicegyu11

thanks a lot for the huge effort, this is (by design) working around many limitations of docker swarm, but nevertheless I see that you accounted for many pitfalls and issues. It looks promissing and robust. Let me know if you need help during the rollout, and I am curious to see if issues pop up or if this "just works" :--)

YuryHrytsuk added 2 commits August 14, 2025 09:28

add-ha-rabbit

e38bae4

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

901ee0c

YuryHrytsuk self-assigned this Aug 14, 2025

YuryHrytsuk added 11 commits August 20, 2025 09:58

Add ha rabbit

1be4f1c

Document erlang cookie rotation

9ad628f

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

8f007fc

Add ha proxy

cf8bbfa

Further configuration

8ca30d7

Document autoscaling (not supported)

439d56a

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

2f86ff9

More configurable parameters

ebd87c9

minor improvements

18e172b

Add resource limits/reservations

1f52e7c

Add haproxy resources

c36db8a

YuryHrytsuk changed the title ~~Add ha rabbit~~ Add ha rabbit (but not use it) Aug 28, 2025

YuryHrytsuk changed the title ~~Add ha rabbit (but not use it)~~ Add (ha) rabbit cluster Aug 28, 2025

YuryHrytsuk added 6 commits August 28, 2025 11:50

Document side effect of haproxy round robin

2ba480e

Add healthcheck for haproxy

4714fa4

Update readme

4706404

Removing volumes

0934599

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

51cd721

Robust volume clean up

e42a00e

+ haproxy extra configuration

YuryHrytsuk changed the title ~~Add (ha) rabbit cluster~~ Add (ha) rabbit cluster (but not use it) Sep 3, 2025

YuryHrytsuk added 6 commits September 4, 2025 09:12

Simplification

4d7d3e3

Add confirmation dialogue

44a9ebe

Unification

f63e5b6

Minor clean up

8d28184

update gitignore

1a95eae

fixes after clean up

3409428

YuryHrytsuk added 10 commits September 11, 2025 15:31

improvements

cf6a1c8

improvements

c83b4f1

fixes and improvements

8f420cc

remove leftovers

9455ae5

Improve doc

c2bef0c

fixes

2a72bfa

Improve README

239ef37

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

1735293

remove lines

fb83f59

Clean up

a8cd85f

YuryHrytsuk requested review from sanderegg and matusdrobuliak66 September 16, 2025 07:06

YuryHrytsuk marked this pull request as ready for review September 16, 2025 07:06

YuryHrytsuk requested a review from mrnicegyu11 as a code owner September 16, 2025 07:06

matusdrobuliak66 approved these changes Sep 16, 2025

View reviewed changes

mrnicegyu11 reviewed Sep 16, 2025

View reviewed changes

mrnicegyu11 approved these changes Sep 16, 2025

View reviewed changes

		@@ -0,0 +1,19 @@
		{% set NODE_IXS = range(1, (RABBIT_CLUSTER_NODE_COUNT \| int) + 1) -%}


		Source: https://www.rabbitmq.com/docs/next/configure#config-changes-effects

		## Enable node Maintenance mode

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

Are you sure you want to change the base?

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

Conversation

YuryHrytsuk commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

Related PR/s

Devops Actions ⚠️

Prerequisites

Checklist

Uh oh!

YuryHrytsuk commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Cluster Formation

Clustering and Clients

Cleaning volumes

HA Proxy highly available

Uh oh!

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrnicegyu11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading