Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s peer discovery v2 #13050

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

k8s peer discovery v2 #13050

wants to merge 3 commits into from

Conversation

mkuratczyk
Copy link
Contributor

@mkuratczyk mkuratczyk commented Jan 10, 2025

Proposed Changes

Completely different implementation of a peer discovery mechanism for Kubernetes. Rather than querying the Kubernetes API, assume the deployment uses a StatefulSet, as it should and just check the local node name. Then:

  • if the node name's suffix is -0 - form a new cluster
  • in any other case, try to join the -0 node; keep trying forever - only the -0 node can form a new cluster

While it works differently internally, it's completely backwards compatible - existing configuration options are accepted but ignored. Cluster Operator can deploy cluster with these changes and everything works with no changes on the Operator side.

Benefits of this approach:

Drawbacks:

  • existing plugin works perfectly fine for a vast majority of users so there's a risk of introducing significant changes

TODO:

  • discuss this approach
  • consider adding some configurability just in case (allow overriding the node name / seed node name?)
  • test in different Kubernetes environments

@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from 7aadc44 to 54eecc1 Compare January 10, 2025 11:23
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch 2 times, most recently from 872519e to d153aaf Compare January 20, 2025 09:49
deps/rabbit/src/rabbit_peer_discovery.erl Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from d153aaf to 5a8865b Compare January 22, 2025 07:55
This came up when testing whether all nodes would wait for node 0
could not start. To test this, I was killing node 0 every few seconds,
so it would attempt to start and then get killed. This led to other
nodes sometimes discovering it and attempting to sync, but then crashing
with function_clause as it went away.
Rather than querying the Kubernetes API, just check the local node name
and try to connect to the pod with ID `0` (`-0` suffix). Only the pod
with ID 0 can form a new cluster - all other pods will wait forever.
This should prevent any race conditions and incorrectly formed clusters.
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from 5a8865b to b8d3186 Compare January 22, 2025 09:23
@mkuratczyk mkuratczyk marked this pull request as ready for review January 22, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants