[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

vinaykpud · 2025-02-12T00:12:51Z

Is your feature request related to a problem? Please describe

Context:
Created a 5 node cluster and created an index with 1P and 1 Search replica.

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles      cluster_manager name
172.18.0.3           26          90   5    2.19    1.84     1.69 d         data            -               opensearch-node5
172.18.0.4           23          90   5    2.19    1.84     1.69 -         coordinating    -               opensearch-node1
172.18.0.2           27          90   6    2.19    1.84     1.69 d         data            -               opensearch-node3
172.18.0.5           23          90   5    2.19    1.84     1.69 d         data            -               opensearch-node4
172.18.0.6           30          90   5    2.19    1.84     1.69 m         cluster_manager *               opensearch-node2

Now following is the shard assignment:

index    shard prirep state   docs store ip         node
products 0     p      STARTED    0  230b 172.18.0.3 opensearch-node5
products 0     s      STARTED    0  230b 172.18.0.2 opensearch-node3

simulate a node drop, Since I am running the cluster locally using docker, I stopped the node3.

index    shard prirep state      docs store ip         node
products 0     p      STARTED       0  230b 172.18.0.3 opensearch-node5
products 0     s      UNASSIGNED

After 1 min, AllocationService will try to allocate the search shard to node4 and it will fail with bellow exception

2025-02-09 13:07:53 "stacktrace": ["org.opensearch.indices.recovery.RecoveryFailedException: [products3][0]: Recovery failed on {opensearch-node8}{eHuGysErRFuGUCFO2KxGuw}{Wby_7fTEToWk5bnavKBlbA}{172.18.0.9}{172.18.0.9:9300}{dimr}{zone=zone3, shard_indexing_pressure_enabled=true}",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3902) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:618) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:123) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2919) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
2025-02-09 13:07:53 "at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:717) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:702) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/usr/share/opensearch/data/nodes/0/indices/PFeBY9eRTaKcRhDKEM2WAQ/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@52539624)): files: []",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:808) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:764) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:542) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:526) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.opensearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:135) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readSegmentsInfo(Store.java:255) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:237) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:692) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more"] }

Describe the solution you'd like

This is happens because in ShardRouting

OpenSearch/server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java

Line 443 in d0a65d3

if (primary() || isSearchOnly()) {

when moveToUnassigned is called, we set the recoverySource as ExistingStoreRecoverySource for the search replica.
Since this scenario involves in recovering the shard in another node, it wont have any files in the local store for recovery and fails with exception. So solution is when its search replica always use EmptyStoreRecoverySource.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

mch2 · 2025-02-13T20:34:57Z

@vinaykpud This will impact cases where a SR node is restarted, can you check our recovery logic to see if thats the case? Ie. we still want to diff local segments and ensure we only fetch wahts required to recover.

vinaykpud · 2025-02-13T23:53:43Z

Added integ test for reproducing this : d89c1cf

vinaykpud · 2025-02-14T00:19:22Z

@mch2 Yes. If the SR node restarts, we should consider loading the available local files instead of starting with an empty directory. In this case, the node restart causes the search replica to become unassigned. When the node comes back up, the allocator attempts to reassign the search replica to the same node. Since the shard was previously assigned to this node, it should already have the necessary files. Therefore, if any local files exist, we should load them.

vinaykpud added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 12, 2025

github-actions bot added the Search:Performance label Feb 12, 2025

github-project-automation bot added this to Search Project Board Feb 12, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 12, 2025

sandeshkr419 assigned vinaykpud and prudhvigodithi Feb 12, 2025

sandeshkr419 removed the untriaged label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

vinaykpud commented Feb 12, 2025 •

edited

Loading

mch2 commented Feb 13, 2025

vinaykpud commented Feb 13, 2025

vinaykpud commented Feb 14, 2025 •

edited

Loading

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

Comments

vinaykpud commented Feb 12, 2025 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

mch2 commented Feb 13, 2025

vinaykpud commented Feb 13, 2025

vinaykpud commented Feb 14, 2025 • edited Loading

vinaykpud commented Feb 12, 2025 •

edited

Loading

vinaykpud commented Feb 14, 2025 •

edited

Loading