-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds CLI flags to delay publishing for edge case testing on PeerDAS devnets #6947
base: unstable
Are you sure you want to change the base?
Conversation
// Add delay before publishing the block to the network. | ||
if let Some(block_publishing_delay) = block_publishing_delay { | ||
std::thread::sleep(block_publishing_delay); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so after messing with this for a while:
std::thread::sleep
: Used in the synchronous closure publish_block_p2p because async closures are unstable and .await cannot be used there.
Yeah I couldn't see an easy way of converting both to an async context. It's a shame because we are in an async context already :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is for testing only, it may significantly impact testing performance, and it may be even worse given we usually run testing under resource constraint environments (e.g. Kurtosis, 4+ nodes on a machine).
I'm thinking maybe we just do the block publishing delay for the BroadcastValidation::Gossip
case, since we mainly use this in devnet testing?
if BroadcastValidation::Gossip == validation_level && should_publish_block {
// add delay here
publish_block_p2p(
block.clone(),
sender_clone.clone(),
log.clone(),
seen_timestamp,
)
.map_err(|_| warp_utils::reject::custom_server_error("unable to publish".into()))?;
}
and if we really want to cover the other two broadcast variants, async closure is stabilising in two weeks... 🤩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ As @ethDreamer pointed out, issue with this is it also delays data columns publishing, and i don't see any easy way around it without making this function even more complex - although I think the specific scenario where we want to test block delay without column delay is pretty low value, we can potentially just leave this as a known issue.
// Add delay before publishing the data columns to the network. | ||
if let Some(data_column_publishing_delay) = data_column_publishing_delay { | ||
tokio::time::sleep(data_column_publishing_delay).await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing here is particularly difficult to reason about due to the way this code is structured. It essentially boils down to:
if validation_level == Gossip {
sleep(block_publishing_delay)
publish_block()
sleep(data_column_publishing_delay)
publish_data_columns()
} else {
sleep(data_column_publishing_delay)
publish_data_columns()
sleep(block_publishing_delay)
publish_block()
}
this is probably not what you really wanted. A quick look through the code indicates we (at least in every example I saw) call this with validation_level == Gossip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops i commented at the wrong place,
#6947 (comment)
DO NOT USE IN PRODUCTION.") | ||
.hide(true) | ||
.display_order(0) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to publish_block
is triggered by the validator client calling /eth/v2/beacon/blocks
or similar endpoint. Currently this endpoint has a default timeout value of:
slot_duration / HTTP_PROPOSAL_TIMEOUT_QUOTIENT
= 6 seconds
So there's an implicit requirement that the sum of these two timeouts not exceed that if you don't want to see error messages on the validator client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Discussed with @ethDreamer offline and we're mostly interested in testing delays between the 0-4s mark, so this should be ok for the purpose of testing.
@@ -887,6 +887,14 @@ pub fn get_config<E: EthSpec>( | |||
.max_gossip_aggregate_batch_size = | |||
clap_utils::parse_required(cli_args, "beacon-processor-aggregate-batch-size")?; | |||
|
|||
if let Some(delay) = clap_utils::parse_optional(cli_args, "delay-block-publishing")? { | |||
client_config.chain.block_publishing_delay = Some(Duration::from_secs(delay)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use from_secs_f64
here to allow us to play with seconds in floating points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @kamuik16!
Please see comments above, i'm ok with the trade-off of not covering the following to keep things simple (it would be worth documenting in the CLI description though!)
- delay publishing blocks without delaying columns - this scenario is less important, since we expect other nodes in the network to publish columns anyway due to distributed blob publishing.
- covering
BroadcastValidation::Consensus
andConsensusAndEquivocation::ConsensusAndEquivocation
: if we just implement it forGossip
(which is the main one used in testing), then we shouldn't need the thread::sleep.
What do you think?
Issue Addressed
Closes #6919
Additional Info
Added two optional config fields (
block_publishing_delay
anddata_column_publishing_delay
) toChainConfig
, with defaults set toNone
.CLI Flags:
Introduced hidden CLI arguments (
--delay-block-publishing
and--delay-data-column-publishing
) to set delays (in seconds) for testing purposes.For block publishing: Modified the
publish_block_p2p
closure to accept a delay parameter. If set, the closure callsstd::thread::sleep(delay)
before publishing.For data columns: Before calling
publish_column_sidecars
, the code awaits atokio::time::sleep(delay)
.Reason for Different Sleep Methods:
std::thread::sleep
: Used in the synchronous closurepublish_block_p2p
because async closures are unstable and.await
cannot be used there.tokio::time::sleep
: Used in the async part of the function to avoid blocking the executor.I also might be completely wrong here, do correct me, or feel free to discard the PR if this is not the solution 😄.