Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Integrate Jepsen Testing for Enhanced Fault Tolerance #17340

Open
laminelam opened this issue Feb 12, 2025 · 4 comments
Open
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Other untriaged

Comments

@laminelam
Copy link

Is your feature request related to a problem? Please describe

OpenSearch lacks formal testing under network partitions and various fault conditions which Jepsen specializes in. This testing is crucial for ensuring the database maintains consistency and handles failures as expected.

Describe the solution you'd like

Integrate Jepsen testing into the OpenSearch testing framework to rigorously assess and improve the system's fault tolerance and consistency. This would include setting up scenarios to test network partitions, data consistency, and recovery mechanisms.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

@laminelam laminelam added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 12, 2025
@github-actions github-actions bot added the Other label Feb 12, 2025
@msfroh
Copy link
Collaborator

msfroh commented Feb 12, 2025

This testing is crucial for ensuring the database maintains consistency and handles failures as expected.

There is an open question of whether OpenSearch is a database.

I would argue that OpenSearch should embrace eventual consistency, rather than trying to maintain ACID-style semantics, especially since Lucene is eventually consistent (since writes are not visible until you open a new reader).

@Pallavi-AWS
Copy link
Member

Jepsen also covers durability and fault tolerance is my understanding - can we evaluate using Jepsen for chaos testing?

@msfroh
Copy link
Collaborator

msfroh commented Feb 12, 2025

Jepsen also covers durability and fault tolerance is my understanding - can we evaluate using Jepsen for chaos testing?

Definitely! I should clarify that I'm not against using Jepsen to test OpenSearch. It's a pretty awesome framework.

@andrross
Copy link
Member

I agree with @msfroh about the stance of not expecting OpenSearch to have ACID-style semantics.

I've added the cluster manager label here because we do expect the cluster management/leader election to be consistent and fault tolerant, and a framework like Jepsen could be valuable. There's probably value beyond cluster management as well, such as asserting the behavior of the primary-replica model of shards and that acknowledged writes are not lost (when replicas are configured), etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Other untriaged
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants