Make range sync peer loadbalancing PeerDAS-friendly #6922

dapplion · 2025-02-06T05:16:46Z

Issue Addressed

Re-opens Make range sync peer loadbalancing PeerDAS-friendly #6864 targeting unstable

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
- Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PR
- Backfill sync: pauses the sync until another peer joins - EDIT: Same logic as unstable

TODOs

Add tests :)

Note: this touches the mainnet path!

dapplion · 2025-02-06T14:00:12Z

I would like to wait to merge this PR first to have more test coverage

Add more range sync tests #6872

dapplion · 2025-02-06T14:10:04Z

beacon_node/network/src/sync/backfill_sync/mod.rs

-    }
-
-    /// Sends and registers the request of a batch awaiting download.
-    fn retry_batch_download(


Now there's no different priorization between first send and retry. For both we prefer not failed peers with least active requests

dapplion · 2025-02-06T14:12:31Z

beacon_node/network/src/sync/backfill_sync/mod.rs

-        idle_peers.shuffle(&mut rng);
-
-        while let Some(peer) = idle_peers.pop() {
+        loop {


If the implementation of include_next_batch is correct we should never infinite loop here. But it's sus, I worry we introduce a bug in the future un-intentionally.

dapplion · 2025-02-06T14:16:36Z

beacon_node/network/src/sync/range_sync/chain.rs

-            }
-        }
+    pub fn remove_peer(&mut self, peer_id: &PeerId) -> ProcessingResult {
+        self.peers.remove(peer_id);


No need to track internal requests and fail them explicitly. If a disconnect interrupts an active request, since #6497 we will get an inject_error for that batch.

dapplion · 2025-02-06T15:03:50Z

beacon_node/network/src/sync/network_context.rs

+    NoPeer(NoPeerError),
+    /// These errors should never happen, including unreachable custody errors or network send
+    /// errors.
+    InternalError(String),


I grouped errors with 2 variants as both backfill and forward range need to handle the NoPeer case and everything else is treated as an internal error

dapplion · 2025-02-06T15:19:08Z

beacon_node/network/src/sync/network_context.rs

+        let columns_by_range_peers_to_request =
+            if matches!(batch_type, ByRangeRequestType::BlocksAndColumns) {
+                let column_indexes = self.network_globals().sampling_columns.clone();
+                Some(self.select_columns_by_range_peers_to_request(&column_indexes, peers)?)


Select column peers before sending the block request. So if we don't have enough custody peers we will error here

dapplion · 2025-02-06T15:20:12Z

beacon_node/network/src/sync/network_context.rs

+                | CustodyRequestError::UnexpectedRequestId { .. }
+                | CustodyRequestError::SendFailed { .. }) => {
+                    RpcRequestSendError::InternalError(format!("{e:?}"))
+                }


Coerce CustodyRequestError into InternalError so we don't have to handle the CustodyRequestError enum downstream

dapplion · 2025-02-06T15:27:22Z

@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check good_peers_on_sampling_subnets in this PR but it's a very sensitive change. I left that logic untouched, what we do now is:

Check if good_peers_on_sampling_subnets
- If no, don't create batch
- If yes, create batch and send it
  - If we don't have enough custody peers, error and drop chain

In the future we can deprecate the good_peers_on_sampling_subnets check by allowing batches to remain in AwaitingDownload state. It's essentially duplicate code as we check for peer twice. It should make sync less likely to drop chains too, like we did in lookup sync by allowing batches to be peer-less for some time.

I added a TODO(das) to tackle in another PR

lighthouse/beacon_node/network/src/sync/range_sync/chain.rs

Lines 913 to 918 in 1debbbf

    
           // TODO(das): Handle the NoPeer case explicitly and don't drop the batch. For 
        
           // sync to work properly it must be okay to have "stalled" batches in 
        
           // AwaitingDownload state. Currently it will error with invalid state if 
        
           // that happens. Sync manager must periodicatlly prune stalled batches like 
        
           // we do for lookup sync. Then we can deprecate the redundant 
        
           // `good_peers_on_sampling_subnets` checks.

Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~

dapplion added 10 commits February 5, 2025 13:03

Remove request tracking inside syncing chains

fbab829

Prioritize by range peers in network context

2faf5f1

Prioritize custody peers for columns by range

0e13a8d

Explicit error handling of the no peers error case

70e6066

Remove good_peers_on_sampling_subnets

8cf4e8c

Count AwaitingDownload towards the buffer limit

4da322f

Retry syncing chains in AwaitingDownload state

9cd238d

Use same peer priorization for lookups

891c8fc

Review PR

29a5aff

Address TODOs

cd6c5d6

dapplion requested a review from jxs as a code owner February 6, 2025 05:16

dapplion added ready-for-review The code is ready for review syncing das Data Availability Sampling labels Feb 6, 2025

dapplion requested a review from jimmygchen February 6, 2025 05:19

Revert changes to peer erroring in range sync

2c6a7cc

dapplion mentioned this pull request Feb 6, 2025

Add PeerDAS metrics to track subnets without peers #6928

Merged

dapplion added 2 commits February 6, 2025 12:01

Revert metrics changes

f77bc24

Update comment

4792275

dapplion force-pushed the peer-loadbalancing branch from 1debbbf to 4792275 Compare February 6, 2025 15:22

dapplion commented Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make range sync peer loadbalancing PeerDAS-friendly #6922

Make range sync peer loadbalancing PeerDAS-friendly #6922

dapplion commented Feb 6, 2025 •

edited

Loading

dapplion commented Feb 6, 2025

dapplion Feb 6, 2025

dapplion Feb 6, 2025

dapplion Feb 6, 2025

dapplion Feb 6, 2025

dapplion Feb 6, 2025

dapplion Feb 6, 2025

dapplion commented Feb 6, 2025

Make range sync peer loadbalancing PeerDAS-friendly #6922

Are you sure you want to change the base?

Make range sync peer loadbalancing PeerDAS-friendly #6922

Conversation

dapplion commented Feb 6, 2025 • edited Loading

Issue Addressed

Proposed Changes

TODOs

dapplion commented Feb 6, 2025

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion Feb 6, 2025

Choose a reason for hiding this comment

dapplion commented Feb 6, 2025

dapplion commented Feb 6, 2025 •

edited

Loading