Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds CLI flags to delay publishing for edge case testing on PeerDAS devnets #6947

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions beacon_node/beacon_chain/src/chain_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ pub struct ChainConfig {
/// The delay in milliseconds applied by the node between sending each blob or data column batch.
/// This doesn't apply if the node is the block proposer.
pub blob_publication_batch_interval: Duration,
/// Artificial delay for block publishing. For PeerDAS testing only.
pub block_publishing_delay: Option<Duration>,
/// Artificial delay for data column publishing. For PeerDAS testing only.
pub data_column_publishing_delay: Option<Duration>,
}

impl Default for ChainConfig {
Expand Down Expand Up @@ -129,6 +133,8 @@ impl Default for ChainConfig {
enable_sampling: false,
blob_publication_batches: 4,
blob_publication_batch_interval: Duration::from_millis(300),
block_publishing_delay: None,
data_column_publishing_delay: None,
}
}
}
Expand Down
16 changes: 15 additions & 1 deletion beacon_node/http_api/src/publish_blocks.rs
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
network_globals: Arc<NetworkGlobals<T::EthSpec>>,
) -> Result<Response, Rejection> {
let seen_timestamp = timestamp_now();
let block_publishing_delay = chain.config.block_publishing_delay;
let data_column_publishing_delay = chain.config.data_column_publishing_delay;

let (unverified_block, unverified_blobs, is_locally_built_block) = match provenanced_block {
ProvenancedBlock::Local(block, blobs, _) => (block, blobs, true),
Expand All @@ -103,8 +105,13 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
let publish_block_p2p = move |block: Arc<SignedBeaconBlock<T::EthSpec>>,
sender,
log,
seen_timestamp|
seen_timestamp,
block_publishing_delay|
-> Result<(), BlockError> {
// Add delay before publishing the block to the network.
if let Some(block_publishing_delay) = block_publishing_delay {
std::thread::sleep(block_publishing_delay);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so after messing with this for a while:

  • std::thread::sleep: Used in the synchronous closure publish_block_p2p because async closures are unstable and .await cannot be used there.

Yeah I couldn't see an easy way of converting both to an async context. It's a shame because we are in an async context already :/

Copy link
Member

@jimmygchen jimmygchen Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is for testing only, it may significantly impact testing performance, and it may be even worse given we usually run testing under resource constraint environments (e.g. Kurtosis, 4+ nodes on a machine).

I'm thinking maybe we just do the block publishing delay for the BroadcastValidation::Gossip case, since we mainly use this in devnet testing?

    if BroadcastValidation::Gossip == validation_level && should_publish_block {
        // add delay here
        publish_block_p2p(
            block.clone(),
            sender_clone.clone(),
            log.clone(),
            seen_timestamp,
        )
        .map_err(|_| warp_utils::reject::custom_server_error("unable to publish".into()))?;
    }

and if we really want to cover the other two broadcast variants, async closure is stabilising in two weeks... 🤩

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ As @ethDreamer pointed out, issue with this is it also delays data columns publishing, and i don't see any easy way around it without making this function even more complex - although I think the specific scenario where we want to test block delay without column delay is pretty low value, we can potentially just leave this as a known issue.

let publish_timestamp = timestamp_now();
let publish_delay = publish_timestamp
.checked_sub(seen_timestamp)
Expand Down Expand Up @@ -152,6 +159,7 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
sender_clone.clone(),
log.clone(),
seen_timestamp,
block_publishing_delay,
)
.map_err(|_| warp_utils::reject::custom_server_error("unable to publish".into()))?;
}
Expand All @@ -167,6 +175,7 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
sender_clone.clone(),
log.clone(),
seen_timestamp,
block_publishing_delay,
)?,
BroadcastValidation::ConsensusAndEquivocation => {
check_slashable(&chain, block_root, &block_to_publish, &log)?;
Expand All @@ -175,6 +184,7 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
sender_clone.clone(),
log.clone(),
seen_timestamp,
block_publishing_delay,
)?;
}
};
Expand Down Expand Up @@ -207,6 +217,10 @@ pub async fn publish_block<T: BeaconChainTypes, B: IntoGossipVerifiedBlock<T>>(
}

if gossip_verified_columns.iter().map(Option::is_some).count() > 0 {
// Add delay before publishing the data columns to the network.
if let Some(data_column_publishing_delay) = data_column_publishing_delay {
tokio::time::sleep(data_column_publishing_delay).await;
}
Copy link
Member

@ethDreamer ethDreamer Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing here is particularly difficult to reason about due to the way this code is structured. It essentially boils down to:

if validation_level == Gossip {
    sleep(block_publishing_delay)
    publish_block()
    sleep(data_column_publishing_delay)
    publish_data_columns()
} else {
    sleep(data_column_publishing_delay)
    publish_data_columns()
    sleep(block_publishing_delay)
    publish_block()
}

this is probably not what you really wanted. A quick look through the code indicates we (at least in every example I saw) call this with validation_level == Gossip

Copy link
Member

@jimmygchen jimmygchen Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops i commented at the wrong place,
#6947 (comment)

publish_column_sidecars(network_tx, &gossip_verified_columns, &chain).map_err(|_| {
warp_utils::reject::custom_server_error("unable to publish data column sidecars".into())
})?;
Expand Down
22 changes: 22 additions & 0 deletions beacon_node/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1590,5 +1590,27 @@ pub fn cli_app() -> Command {
.action(ArgAction::Set)
.display_order(0)
)
.arg(
Arg::new("delay-block-publishing")
.long("delay-block-publishing")
.value_name("SECONDS")
.action(ArgAction::Set)
.help_heading(FLAG_HEADER)
.help("TESTING ONLY: Artificially delay block publishing by the specified number of seconds. \
DO NOT USE IN PRODUCTION.")
.hide(true)
.display_order(0)
)
.arg(
Arg::new("delay-data-column-publishing")
.long("delay-data-column-publishing")
.value_name("SECONDS")
.action(ArgAction::Set)
.help_heading(FLAG_HEADER)
.help("TESTING ONLY: Artificially delay data column publishing by the specified number of seconds. \
DO NOT USE IN PRODUCTION.")
.hide(true)
.display_order(0)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to publish_block is triggered by the validator client calling /eth/v2/beacon/blocks or similar endpoint. Currently this endpoint has a default timeout value of:

slot_duration / HTTP_PROPOSAL_TIMEOUT_QUOTIENT = 6 seconds

So there's an implicit requirement that the sum of these two timeouts not exceed that if you don't want to see error messages on the validator client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Discussed with @ethDreamer offline and we're mostly interested in testing delays between the 0-4s mark, so this should be ok for the purpose of testing.

.group(ArgGroup::new("enable_http").args(["http", "gui", "staking"]).multiple(true))
}
8 changes: 8 additions & 0 deletions beacon_node/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -887,6 +887,14 @@ pub fn get_config<E: EthSpec>(
.max_gossip_aggregate_batch_size =
clap_utils::parse_required(cli_args, "beacon-processor-aggregate-batch-size")?;

if let Some(delay) = clap_utils::parse_optional(cli_args, "delay-block-publishing")? {
client_config.chain.block_publishing_delay = Some(Duration::from_secs(delay));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use from_secs_f64 here to allow us to play with seconds in floating points?

}

if let Some(delay) = clap_utils::parse_optional(cli_args, "delay-data-column-publishing")? {
client_config.chain.data_column_publishing_delay = Some(Duration::from_secs(delay));
}

Ok(client_config)
}

Expand Down