-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to add new node to cluster #476
Comments
+1 |
I was just able to workaround the issue by generating the join token from another node in the cluster. Even though there aren't any connectivity issues between the original node and new one I tried joining (they are on a physical 40G switch), this somehow worked. |
Thanks for the suggestion Jake! I just gave that a try, but in my case it doesn't matter which node I generate the token from, they all result in the same failure. All hosts are on the same network here too, and there are no connectivity issues between any of them (no firewalls, for now, anywhere). |
With some encouragement from my colleague, I decided to test the bounds of this. I spun up and bootstrapped a fresh 3-node microceph cluster which went swimmingly. I added some loopback OSD's though I'm unsure if that was necessary. Anyway, adding the 4th node fails in the same way as is happening on my existing cluster, so this seems to be fairly easily reproducible. |
Hey @adam-vest I'm having trouble reproducing this. I've used the following to create a 4-node cluster (using lxd)
This resulted in a healthy cluster of 4 nodes and 8 OSDs, running squid/stable (ceph-version: 19.2.0-0ubuntu0.24.04.1; microceph-git: 9aeaeb2) on noble. Could you post some details on how this was set up? |
@Ziris85 which microceph versions are currently running in your cluster? Generally we strongly recommend getting all nodes on the same MicroCeph versions as schema differences between nodes might stop microcephd |
my observation is:
hope that helps with debugging. Perhaps relate to database not synced through the existing nodes? |
Thank you for reporting your feedback to us! The internal ticket has been created: https://warthogs.atlassian.net/browse/CEPH-1171.
|
Issue report
What version of MicroCeph are you using ?
What are the steps to reproduce this issue ?
microceph cluster add ceph-member
microceph cluster join <token>
What happens (observed behaviour) ?
What were you expecting to happen ?
Member joins cluster successfully
Relevant logs, error output, etc.
These logs are on the new host:
Additional comments.
This issue may be related in some fashion to #444 & #473 . I've had a microceph cluster for a while and have upgraded it a few times (quicy to reef, and just today to squid), and after my upgrade to reef I lost the ability to add new nodes to my cluster. At the time (maybe a year ago) it wasn't a huge deal and I just left it alone, figuring the issue could either be figured out easily enough or it was a bug that'd get fixed in due time. Fast forward to today and I'm still unable to join new nodes to my cluster, and it's now blocking me from doing maintenance to my cluster. After doing some research and seeing the PR in #444 I hoped that maybe upgrading my cluster to squid would pull in a fix and let me get going with this, though sadly it hasn't helped. It did change the behavior, however - whereas before in reef when the join failed with the
context deadline exceeded
message the new host wouldn't show up in the cluster list, now the host shows up anyway, and just starts looping about the database not being ready ad infinitum.In the logs from the dqlite leader while joining, I see a flurry of activity, but all it's showing is:
It'll eventually switch from saying that the database is still starting to saying that it's offline:
There'll be another flurry of
Dqlite connected outbound
messages to the new host IP until the new host command gives up with thecontext deadline exceeded
message, and at that point in the current cluster logs we see the connection start failing:The new host logs will go into the loop of the database not being ready and the cluster not being able to talk to it. Upgrading to squid was at least helpful in getting some slightly more useful logs - on reef there was virtually nothing.
I REALLY need to be able to add new hosts to this cluster since I need to take some of the existing ones out for a rebuild. Not only is this preventing me from doing that, but it makes me nervous that I'd be unable to rejoin the hosts to the cluster when they're ready to go.
Let me know if you'd like more details on this! Thanks in advance!
The text was updated successfully, but these errors were encountered: