Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

Open
Vikasht34 opened this issue Feb 12, 2025 · 1 comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Roadmap:Modular Architecture Project-wide roadmap label

Comments

@Vikasht34
Copy link

Is your feature request related to a problem? Please describe

Problem

Right now, OpenSearch stores all metadata (indexes, aliases, templates, etc.) in ClusterState, which has some problems:

  1. Too Tightly Coupled

    • Every metadata update (create/update/delete) blocks ClusterState updates.
    • This slows things down, especially in large clusters with frequent metadata changes.
  2. High Memory Usage

    • The entire metadata set is kept in memory and replicated across all nodes.
    • This increases RAM usage and makes scaling harder.
  3. Slow Recovery After Crash

    • If the Cluster Manager crashes, it has to rebuild all metadata from followers, which takes time.
    • No persistent storage for metadata means recovery is slow.
  4. Not Cloud-Native

    • OpenSearch needs a stateless Cluster Manager for better cloud scaling.
    • Right now, it depends too much on in-memory metadata storage.

Describe the solution you'd like

Proposed Solution: Use RocksDB for Metadata Storage

Instead of keeping metadata only in memory, I am proposing , we can store it in RocksDB (a fast key-value database).
This will decouple metadata storage from ClusterState, making OpenSearch faster and more scalable.

How This Helps

  1. Faster Index Creation & Updates

    • Metadata changes go to RocksDB first instead of blocking ClusterState.
    • RocksDB handles high-speed writes with less memory usage.
  2. Better Memory Efficiency

    • Metadata isn’t stored fully in RAM anymore.
    • RocksDB loads only what’s needed, reducing memory load.
  3. Crash Recovery is Instant

    • RocksDB uses a Write-Ahead Log (WAL), so if a Cluster Manager crashes, metadata is not lost.
    • On restart, it replays the WAL and restores unflushed changes.
  4. Supports Stateless Cluster Manager

    • OpenSearch can run without storing all metadata in memory.
    • A new Cluster Manager node can pick up metadata from RocksDB instantly.

Image

Implementation Plan

1️⃣ Store Metadata in RocksDB

  • Each metadata item (index settings, aliases, templates) is stored as a key-value pair in RocksDB.
  • Example format:
    Key: "index:my-index"
    Value: { "shards": 5, "replicas": 2, "aliases": ["logs"] }
    
    Key: "alias:logs"
    Value: { "target": "my-index" }
    
    Key: "template:default"
    Value: { "mappings": { "type": "text" }, "settings": { "refresh_interval": "1s" } }
    
    
    

Related component

Cluster Manager

Describe alternatives you've considered

Evaluating Metadata Storage Solutions for OpenSearch Cluster Manager

Comparison of Metadata Storage Options

Option Pros Cons Best Use Case
RocksDB Fast reads and writes with microsecond latency. Embedded within OpenSearch. Works in on-premise and cloud environments. Provides durability through Write-Ahead Logging (WAL). Uses local disk, which is not shared across nodes. Requires compaction tuning for optimal performance. High-speed metadata updates. On-premise or hybrid OpenSearch clusters.
DynamoDB (AWS) Fully managed and highly scalable. No maintenance required. Cloud-native with multi-AZ support. Higher read and write latency due to network overhead. Limited to AWS, making multi-cloud support difficult. Expensive at a large scale. AWS-based OpenSearch deployments. Serverless OpenSearch.
S3 (AWS) Durable object storage. Good for backups and long-term metadata storage. Very slow for real-time operations. Not a key-value store and lacks transactions. Not suitable for live metadata operations. Storing metadata snapshots instead of live queries.
FoundationDB ACID-compliant with strong consistency. Scales horizontally as a distributed key-value store. Complex to deploy and manage. Not widely used in OpenSearch. Distributed metadata storage across multiple OpenSearch nodes.
Etcd Strong consistency with built-in leader election. Reliable and widely used in Kubernetes. Slower writes at large scale. Requires extra cluster management. Small-scale cluster metadata coordination.
Apache Zookeeper Strong consistency with leader election. Well-established for coordination tasks. Slower writes and not optimized for large-scale metadata. Operational complexity for managing nodes. Metadata coordination for leader election and configuration storage.
PostgreSQL ACID-compliant with transaction support. Scalable through read replicas. Slower than LSM-based storage like RocksDB. Higher operational cost and complexity. When metadata requires SQL queries and complex transactions.
Raft-based Storage (Custom Solution) Fully distributed across OpenSearch nodes. Strong consistency and failover support. Complex to implement and maintain. Requires custom metadata synchronization logic. Fully decentralized metadata storage without a separate external system.

Why RocksDB is the Best Fit for OpenSearch

Fast Reads and Writes

RocksDB provides microsecond-level read and write latency because it runs locally within OpenSearch. Other options like DynamoDB and S3 introduce network overhead, increasing response times.

Strong Consistency and Durability

RocksDB ensures strong consistency through Write-Ahead Logging (WAL), allowing metadata changes to be recovered after a crash. DynamoDB is eventually consistent unless configured for strong consistency, which adds cost and complexity. S3 does not provide transactional consistency.

Works in Any Deployment Environment

RocksDB is embedded in OpenSearch and does not rely on external cloud services. It works in on-premise, hybrid, and multi-cloud environments. In contrast, DynamoDB and S3 are AWS-only solutions, making them unsuitable for OpenSearch clusters deployed across multiple cloud providers or on-premise.

Lower Operational Overhead

RocksDB runs inside OpenSearch without requiring external database management. There are no API rate limits, IAM role configurations, or additional networking costs. In comparison, DynamoDB requires fine-tuning capacity settings and monitoring costs, while S3 is not built for real-time metadata storage.

High-Throughput Metadata Operations

OpenSearch needs to perform fast, atomic metadata updates such as index creation and alias management. RocksDB supports millions of writes per second with batch transactions. DynamoDB scales well but incurs additional latency due to network requests. S3 is not suitable for live metadata updates.


Would love to gather feedback on this choice before moving forward with a proof of concept.

Additional context

No response

@rajiv-kv
Copy link
Contributor

@Vikasht34 - Thanks for filing the issue.

ClusterState is persisted to disk as Lucene documents for recovery and is not reconstructed always from follower nodes. Do you also want to consider this in your analysis ?

Fast Reads and Writes

Do we need to substantiate with Perf numbers to compare with in-memory access ? I understand it is better than network access but might be slower than fetching from heap and impacting some of the critical flows.

Decoupling

There is some level of decoupling on how the inmemory ClusterState is persisted to store. Please take a look at the PersistedState interface and how it is leveraged for persisting to S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Roadmap:Modular Architecture Project-wide roadmap label
Projects
Status: 🆕 New
Status: New
Development

No branches or pull requests

2 participants