-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make range sync peer loadbalancing PeerDAS-friendly #6922
base: unstable
Are you sure you want to change the base?
Conversation
I would like to wait to merge this PR first to have more test coverage |
1debbbf
to
4792275
Compare
} | ||
|
||
/// Sends and registers the request of a batch awaiting download. | ||
fn retry_batch_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now there's no different priorization between first send and retry. For both we prefer not failed peers with least active requests
idle_peers.shuffle(&mut rng); | ||
|
||
while let Some(peer) = idle_peers.pop() { | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the implementation of include_next_batch
is correct we should never infinite loop here. But it's sus, I worry we introduce a bug in the future un-intentionally.
} | ||
} | ||
pub fn remove_peer(&mut self, peer_id: &PeerId) -> ProcessingResult { | ||
self.peers.remove(peer_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to track internal requests and fail them explicitly. If a disconnect interrupts an active request, since #6497 we will get an inject_error for that batch.
NoPeer(NoPeerError), | ||
/// These errors should never happen, including unreachable custody errors or network send | ||
/// errors. | ||
InternalError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grouped errors with 2 variants as both backfill and forward range need to handle the NoPeer
case and everything else is treated as an internal error
let columns_by_range_peers_to_request = | ||
if matches!(batch_type, ByRangeRequestType::BlocksAndColumns) { | ||
let column_indexes = self.network_globals().sampling_columns.clone(); | ||
Some(self.select_columns_by_range_peers_to_request(&column_indexes, peers)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Select column peers before sending the block request. So if we don't have enough custody peers we will error here
| CustodyRequestError::UnexpectedRequestId { .. } | ||
| CustodyRequestError::SendFailed { .. }) => { | ||
RpcRequestSendError::InternalError(format!("{e:?}")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coerce CustodyRequestError
into InternalError
so we don't have to handle the CustodyRequestError enum downstream
@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check
In the future we can deprecate the I added a lighthouse/beacon_node/network/src/sync/range_sync/chain.rs Lines 913 to 918 in 1debbbf
|
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Issue Addressed
Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.
Issues with current unstable:
Proposed Changes
block_components_by_range_request
gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the errorRpcRequestSendError::NoPeer
RpcRequestSendError::NoPeer
explicitlyAwaitingDownload
state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PRTODOs
Note: this touches the mainnet path!