Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make range sync peer loadbalancing PeerDAS-friendly #6922

Open
wants to merge 13 commits into
base: unstable
Choose a base branch
from

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Feb 6, 2025

Issue Addressed

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

  • Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
  • Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
  • The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

  • Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
  • Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
  • Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
    • Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PR
    • Backfill sync: pauses the sync until another peer joins - EDIT: Same logic as unstable

TODOs

  • Add tests :)

Note: this touches the mainnet path!

@dapplion dapplion requested a review from jxs as a code owner February 6, 2025 05:16
@dapplion dapplion added ready-for-review The code is ready for review syncing das Data Availability Sampling labels Feb 6, 2025
@dapplion dapplion requested a review from jimmygchen February 6, 2025 05:19
@dapplion
Copy link
Collaborator Author

dapplion commented Feb 6, 2025

I would like to wait to merge this PR first to have more test coverage

}

/// Sends and registers the request of a batch awaiting download.
fn retry_batch_download(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there's no different priorization between first send and retry. For both we prefer not failed peers with least active requests

idle_peers.shuffle(&mut rng);

while let Some(peer) = idle_peers.pop() {
loop {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation of include_next_batch is correct we should never infinite loop here. But it's sus, I worry we introduce a bug in the future un-intentionally.

}
}
pub fn remove_peer(&mut self, peer_id: &PeerId) -> ProcessingResult {
self.peers.remove(peer_id);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to track internal requests and fail them explicitly. If a disconnect interrupts an active request, since #6497 we will get an inject_error for that batch.

NoPeer(NoPeerError),
/// These errors should never happen, including unreachable custody errors or network send
/// errors.
InternalError(String),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grouped errors with 2 variants as both backfill and forward range need to handle the NoPeer case and everything else is treated as an internal error

let columns_by_range_peers_to_request =
if matches!(batch_type, ByRangeRequestType::BlocksAndColumns) {
let column_indexes = self.network_globals().sampling_columns.clone();
Some(self.select_columns_by_range_peers_to_request(&column_indexes, peers)?)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Select column peers before sending the block request. So if we don't have enough custody peers we will error here

| CustodyRequestError::UnexpectedRequestId { .. }
| CustodyRequestError::SendFailed { .. }) => {
RpcRequestSendError::InternalError(format!("{e:?}"))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coerce CustodyRequestError into InternalError so we don't have to handle the CustodyRequestError enum downstream

@dapplion
Copy link
Collaborator Author

dapplion commented Feb 6, 2025

@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check good_peers_on_sampling_subnets in this PR but it's a very sensitive change. I left that logic untouched, what we do now is:

  • Check if good_peers_on_sampling_subnets
    • If no, don't create batch
    • If yes, create batch and send it
      • If we don't have enough custody peers, error and drop chain

In the future we can deprecate the good_peers_on_sampling_subnets check by allowing batches to remain in AwaitingDownload state. It's essentially duplicate code as we check for peer twice. It should make sync less likely to drop chains too, like we did in lookup sync by allowing batches to be peer-less for some time.

I added a TODO(das) to tackle in another PR

// TODO(das): Handle the NoPeer case explicitly and don't drop the batch. For
// sync to work properly it must be okay to have "stalled" batches in
// AwaitingDownload state. Currently it will error with invalid state if
// that happens. Sync manager must periodicatlly prune stalled batches like
// we do for lookup sync. Then we can deprecate the redundant
// `good_peers_on_sampling_subnets` checks.

mergify bot pushed a commit that referenced this pull request Feb 11, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-review The code is ready for review syncing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant