Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to add new node to cluster #476

Open
Ziris85 opened this issue Nov 25, 2024 · 8 comments
Open

Unable to add new node to cluster #476

Ziris85 opened this issue Nov 25, 2024 · 8 comments
Labels
question Further information is requested

Comments

@Ziris85
Copy link

Ziris85 commented Nov 25, 2024

Issue report

What version of MicroCeph are you using ?

:~# snap list microceph
Name       Version                Rev   Tracking      Publisher   Notes
microceph  19.2.0+snap9aeaeb2970  1228  squid/stable  canonical✓  held

What are the steps to reproduce this issue ?

  1. On existing cluster member: microceph cluster add ceph-member
  2. On new member: microceph cluster join <token>

What happens (observed behaviour) ?

Error: Failed to join cluster: Ready dqlite: context deadline exceeded

What were you expecting to happen ?

Member joins cluster successfully

Relevant logs, error output, etc.

These logs are on the new host:

2024-11-24T23:50:45Z microceph.daemon[13869]: time="2024-11-24T23:50:45Z" level=error msg="PostRefresh failed: version equality check failed: failed to get Ceph versions: Failed to run: ceph versions: exit status 1 (Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)'))"
2024-11-24T23:50:55Z microceph.daemon[13869]: time="2024-11-24T23:50:55Z" level=debug msg="start: database not ready, waiting..."
2024-11-24T23:51:05Z microceph.daemon[13869]: time="2024-11-24T23:51:05Z" level=debug msg="start: database not ready, waiting..."
2024-11-24T23:51:15Z microceph.daemon[13869]: time="2024-11-24T23:51:15Z" level=debug msg="start: database not ready, waiting..."
...etc...

Additional comments.

This issue may be related in some fashion to #444 & #473 . I've had a microceph cluster for a while and have upgraded it a few times (quicy to reef, and just today to squid), and after my upgrade to reef I lost the ability to add new nodes to my cluster. At the time (maybe a year ago) it wasn't a huge deal and I just left it alone, figuring the issue could either be figured out easily enough or it was a bug that'd get fixed in due time. Fast forward to today and I'm still unable to join new nodes to my cluster, and it's now blocking me from doing maintenance to my cluster. After doing some research and seeing the PR in #444 I hoped that maybe upgrading my cluster to squid would pull in a fix and let me get going with this, though sadly it hasn't helped. It did change the behavior, however - whereas before in reef when the join failed with the context deadline exceeded message the new host wouldn't show up in the cluster list, now the host shows up anyway, and just starts looping about the database not being ready ad infinitum.

In the logs from the dqlite leader while joining, I see a flurry of activity, but all it's showing is:

2024-11-24T18:43:47-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:47-05:00" level=debug msg="Trusting HTTP request to \"/core/internal/database\" from \"192.168.1.158:36802\" with fingerprint \"7c7dea613ef099c017412eff3c86dba0452aed7f5d616b73c2c548ce8173ffe0\""
2024-11-24T18:43:47-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:47-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:41572" remote="192.168.1.158:7443"
2024-11-24T18:43:47-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:47-05:00" level=debug msg="Trusting HTTP request to \"/core/internal/database\" from \"192.168.1.158:36822\" with fingerprint \"7c7dea613ef099c017412eff3c86dba0452aed7f5d616b73c2c548ce8173ffe0\""

2024-11-24T18:43:54-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:54-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:55812" remote="192.168.1.158:7443"
2024-11-24T18:43:54-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:54-05:00" level=debug msg="Got raw response struct from microcluster daemon" endpoint="https://192.168.1.158:7443/core/1.0/ready" method=GET
2024-11-24T18:43:54-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:54-05:00" level=debug msg="Got raw response struct from microcluster daemon" endpoint="https://192.168.1.158:7443/core/internal/heartbeat" method=POST
2024-11-24T18:43:54-05:00 microceph.daemon[3355246]: time="2024-11-24T18:43:54-05:00" level=error msg="Received error sending heartbeat to cluster member" error="Database is still starting" target="192.168.1.158:7443"
2024-11-24T18:44:17-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:17-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42002" remote="192.168.1.158:7443"
2024-11-24T18:44:17-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:17-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42012" remote="192.168.1.158:7443"
2024-11-24T18:44:18-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:18-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42014" remote="192.168.1.158:7443"
2024-11-24T18:44:18-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:18-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42016" remote="192.168.1.158:7443"
2024-11-24T18:44:19-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:19-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42026" remote="192.168.1.158:7443"
2024-11-24T18:44:19-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:19-05:00" level=debug msg="Dqlite connected outbound" local="192.168.1.3:42030" remote="192.168.1.158:7443"

It'll eventually switch from saying that the database is still starting to saying that it's offline:

2024-11-24T18:44:25-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:25-05:00" level=debug msg="Got raw response struct from microcluster daemon" endpoint="https://192.168.1.158:7443/core/internal/heartbeat" method=POST
2024-11-24T18:44:25-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:25-05:00" level=error msg="Received error sending heartbeat to cluster member" error="Database is offline" target="192.168.1.158:7443"

There'll be another flurry of Dqlite connected outbound messages to the new host IP until the new host command gives up with the context deadline exceeded message, and at that point in the current cluster logs we see the connection start failing:

2024-11-24T18:44:45-05:00 microceph.daemon[3355246]: time="2024-11-24T18:44:45-05:00" level=warning msg="Failed to get status of cluster member with address \"https://192.168.1.158:7443\": Get \"https://192.168.1.158:7443/core/1.0/ready\": Unable to connect to \"192.168.1.158:7443\": dial tcp 192.168.1.158:7443: connect: connection refused"

The new host logs will go into the loop of the database not being ready and the cluster not being able to talk to it. Upgrading to squid was at least helpful in getting some slightly more useful logs - on reef there was virtually nothing.

I REALLY need to be able to add new hosts to this cluster since I need to take some of the existing ones out for a rebuild. Not only is this preventing me from doing that, but it makes me nervous that I'd be unable to rejoin the hosts to the cluster when they're ready to go.

Let me know if you'd like more details on this! Thanks in advance!

@slapcat
Copy link

slapcat commented Dec 4, 2024

+1

@slapcat
Copy link

slapcat commented Dec 4, 2024

I was just able to workaround the issue by generating the join token from another node in the cluster. Even though there aren't any connectivity issues between the original node and new one I tried joining (they are on a physical 40G switch), this somehow worked.

@adam-vest
Copy link

Thanks for the suggestion Jake! I just gave that a try, but in my case it doesn't matter which node I generate the token from, they all result in the same failure. All hosts are on the same network here too, and there are no connectivity issues between any of them (no firewalls, for now, anywhere).

@adam-vest
Copy link

With some encouragement from my colleague, I decided to test the bounds of this. I spun up and bootstrapped a fresh 3-node microceph cluster which went swimmingly. I added some loopback OSD's though I'm unsure if that was necessary. Anyway, adding the 4th node fails in the same way as is happening on my existing cluster, so this seems to be fairly easily reproducible.

@sabaini
Copy link
Collaborator

sabaini commented Jan 15, 2025

Hey @adam-vest I'm having trouble reproducing this. I've used the following to create a 4-node cluster (using lxd)

channel=squid/stable
base=24.04

for n in node-0 node-1 node-2 node-3 ; do
  lxc launch ubuntu:$base $n --vm 
done

sleep 30


for n in node-0 node-1 node-2 node-3 ; do
    lxc exec $n -- sh -c "sudo snap install microceph --channel $channel"    
done

lxc exec node-0 -- sh -c "microceph cluster bootstrap"
sleep 10

for n in node-1 node-2 node-3 ; do
    tok=$(lxc exec node-0 -- sh -c "microceph cluster add $n" )
    lxc exec $n -- sh -c "microceph cluster join $tok"
    sleep 10
done

for n in node-0 node-1 node-2 node-3 ; do
  lxc exec $n -- sh -c "microceph disk add loop,2G,2"    
done

This resulted in a healthy cluster of 4 nodes and 8 OSDs, running squid/stable (ceph-version: 19.2.0-0ubuntu0.24.04.1; microceph-git: 9aeaeb2) on noble.

Could you post some details on how this was set up?

@sabaini sabaini added the question Further information is requested label Jan 15, 2025
@sabaini
Copy link
Collaborator

sabaini commented Jan 15, 2025

@Ziris85 which microceph versions are currently running in your cluster? Generally we strongly recommend getting all nodes on the same MicroCeph versions as schema differences between nodes might stop microcephd

@fzhan
Copy link

fzhan commented Feb 12, 2025

my observation is:

  1. after the node tries to join the cluster, run microceph cluster list on the existing cluster, brings up a table with the newly join node
  2. however the node's IP address states something like 10.0.1.1 which is not reachable, because other nodes are 192.*.*.*
  3. the node drops out of the list after several tries
  4. and I've notice the token never really changes, after failing the token regenerated stays the same.

hope that helps with debugging.

Perhaps relate to database not synced through the existing nodes?

Copy link

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CEPH-1171.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants