Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown fix test #1885

Closed
wants to merge 32 commits into from
Closed

Shutdown fix test #1885

wants to merge 32 commits into from

Conversation

ss-es
Copy link
Contributor

@ss-es ss-es commented Aug 21, 2024

do not merge

jbearer and others added 30 commits August 15, 2024 15:52
Set up some Rust automation for tests that spin up a sequencer
network and restart various combinations of nodes, checking that we
recover liveness. Instantiate the framework with several combinations
of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.

As expected, the tests where we restart >f nodes do not pass yet,
and are ignored. The others pass locally.

There are many things left to test here, including:
* Testing with a valid libp2p setup
* Testing with _only_ libp2p and no CDN
* Checking integrity of the DA/query service during and after restart
But this is a pretty good starting point.

I considered doing this with something more dynamic like Bash or
Python scripting, leaning on our existing docker-compose or process-compose
infrastructure to spin up a network. I avoided this for a few reasons:
* process-compose is annoying to script and in particular has limited
  capabilities for shutting down and starting up processes
* both docker-compose and process-compose make it hard to dynamically
  choose the network topology
* once the basic test infrastructure is out of the way, Rust is far
  easier to work with for writing new checks and assertions. For
  example, checking for progress is way easier when we can plug
  directly into the HotShot event stream, vs subscribing to some
  stream via HTTP and parsing responses with jq
This is needed for the restart tests, where initialization can sometimes
fail after a restart due to the libp2p port not being deallocated by the
OS quickly enough. This necessitates a retry loop, which means all error
cases need to return an error rather than panicking.
Previously, the database used by the query API was populated from a
completely separate event handling task than the consensus storage.
This could lead to a situation where consensus storage has already
been updated with a newly decided leaf, but API storage has not, and
then the node restarts, so that consensus things it is on a later leaf,
but the query API has never and will never see this leaf, and thus cannot
make it available: a DA failure.

With this change, the query database is now populated from the consensus
storage, so that consensus storage is authoritative, and the query datbase
is guaranteed to always eventually reflect the status of consensus storage.
The movement of data from consensus storage to query storage is tied in with
consensus garbage collection, so that we do not delete any data until we are
sure it has been recorded in the DA database, if appropriate.

This also obsoletes the in-memory payload storage in HotShot, since we are
now able to load full payloads from storage on each decide, if available.
* Don't panic in SQL persistence `collect_garbage` when no new leaves
  are decided
* Don't fail fs persistence `load_quorum_proposals` when the proposals
  directory does not exist
* Better logging for libp2p startup
…processed

Store last processed leaf view in Postgres rather than trying to
dead reckon.
These tests require non-DA nodes to store merklized state. Disabling
until we have that functionality.
@jparr721 jparr721 closed this Feb 13, 2025
@jparr721 jparr721 deleted the ss/fix-shutdown branch February 13, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants