-
Notifications
You must be signed in to change notification settings - Fork 6
Add (ha) rabbit cluster (but not use it) ⚠️ #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
TODO
Clients should properly use HA rabbit
Cluster FormationSource https://www.rabbitmq.com/docs/clustering Ways of Forming a Cluster
Node Names (Identifiers)
Cluster Formation Requirements
Ports That Must Be Opened for Clustering and Replication --> all works by default in docker swarm (all ports allowed)
Nodes in a Cluster
For two nodes to be able to communicate they must have the same shared secret called the Erlang cookie.
Node Counts and Quorum:
Clustering and ClientsMessaging Protocols
Stream Clients
Queue and Stream Leader Replica Placement
Cleaning volumes
HA Proxy highly available
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job 👍 thanks
|
||
validate-NODE_COUNT: guard-NODE_COUNT | ||
@if ! echo "$(NODE_COUNT)" | grep --quiet --extended-regexp '^[1-9]$$'; then \ | ||
echo NODE_COUNT must be a positive single digit integer; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: NODE_COUNT must be a positive single digit integer > 0
fi | ||
|
||
validate-node-ix0%: .env | ||
@if ! echo "$*" | grep --quiet --extended-regexp '^[0-9]+$$'; then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: since you will validate that the integer is >= 1 in a later row, you can also already check that in the regex as such: ^[1-9]+$$
start-cluster: start-all-nodes start-loadbalancer | ||
|
||
update-cluster stop-cluster: | ||
@$(error This operation may break cluster. Check README for details.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this dummy target with an error
envsubst < $< > $@; \ | ||
echo NODE_INDEX=$* >> $@ | ||
|
||
.PRECIOUS: docker-compose.node0%.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRECIOUS
is a new thing to me, reading from https://www.gnu.org/software/make/manual/html_node/Special-Targets.html I think these could actually be "regular" .PHONY
targets, or not? 🤔
echo NODE_INDEX=$* >> $@ | ||
|
||
.PRECIOUS: docker-compose.node0%.yml | ||
docker-compose.node0%.yml: docker-compose.node0x.yml.j2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff with the %
, a bit hard to read if one doesnt know makefiles but we are makefile experts :D
start_interval: 10s | ||
|
||
volumes: | ||
rabbit0{{ NODE_INDEX }}_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff with the looping/templating for multiple nodes.
We used to have a kind-of similar thing for the on-premise minio (was runnning on dalco-prod to provide on-prem S3), you can compare and crosscheck if you want. Maybe there are somethings to find, dont remeber actually https://github.com/ITISFoundation/osparc-ops-environments/blob/8f22a93acf33ec70b55d889e7dae26a4756accdb/services/minio/docker-compose.yaml.j2
deploy: | ||
placement: | ||
constraints: | ||
- node.labels.rabbit0{{ NODE_INDEX }} == true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require a docker labels change in osparc-ops-deployment-configuration and associated PRs I guess :)
gid: "999" | ||
volumes: | ||
- rabbit0{{ NODE_INDEX }}_data:/var/lib/rabbitmq | ||
# TODO: sync with existing rabbit attached networks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: not sure what this todo actually means, dont fully get it
@@ -0,0 +1,19 @@ | |||
{% set NODE_IXS = range(1, (RABBIT_CLUSTER_NODE_COUNT | int) + 1) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can we sync this with how rabbit is configured in osparc-simcore, so that the backend dev's setup mimicks the prod one closely?
# haproxy by default resolves server hostname only once | ||
# this breaks if container restarts. By using resolvers | ||
# we tell haproxy to re-resolve the hostname (so container | ||
# restarts are handled properly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, good find
balance roundrobin | ||
|
||
option forwardfor | ||
http-request set-header X-Forwarded-Port %[dst_port] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curioisity: is there a reason this must be set that you remember? due to ha proxy?
{% for ix in NODE_IXS %} | ||
server rabbit0{{ ix }} rabbit-node0{{ ix }}_rabbit0{{ ix }}:{{ RABBIT_MANAGEMENT_PORT }} check resolvers dockerdns init-addr libc,none inter 5s rise 2 fall 3 | ||
{%- endfor %} | ||
# keep new line in the end to avoid "Missing LF on last line" error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
|
||
Source: https://www.rabbitmq.com/docs/next/configure#config-changes-effects | ||
|
||
## Enable node Maintenance mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good readme, can you write one sentence or link to docs that explain what maintenance mode does?
cpus: "0.1" | ||
memory: "128M" | ||
healthcheck: # https://stackoverflow.com/a/76513320/12124525 | ||
test: bash -c 'echo "" > /dev/tcp/127.0.0.1/32087 || exit 1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for the huge effort, this is (by design) working around many limitations of docker swarm, but nevertheless I see that you accounted for many pitfalls and issues. It looks promissing and robust. Let me know if you need help during the rollout, and I am curious to see if issues pop up or if this "just works" :--)
What do these changes do?
Add standlone rabbitmq cluster stack.
Next step:
FYI: @pcrespov @GitHK
Related issue/s
Related PR/s
Devops Actions⚠️
Prerequisites
Checklist
New stack
New service
Service is monitored (via prometheus and grafana)--> to be done in next PR when we switch from rabbit (in simcore stack) to cluster rabbit introduced hereService is not bound to one specific node(e.g. via files or volumes) --> it is bound because of volumes. no way around in our docker swarm setupIf exposed via traefik
Service's Public URL is included in maintenance mode--> unrelatedService's Public URL is included in testing mode--> unrelatedCredentials page is updated--> to be updated in another PR when we switch traffic to this rabbit clusterUrl added to e2e test services(e2e test checking that URL can be accessed) --> to be done when we swtich traffic