[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

Vikasht34 · 2025-02-12T06:01:00Z

Is your feature request related to a problem? Please describe

Problem

Right now, OpenSearch stores all metadata (indexes, aliases, templates, etc.) in ClusterState, which has some problems:

Too Tightly Coupled
- Every metadata update (create/update/delete) blocks ClusterState updates.
- This slows things down, especially in large clusters with frequent metadata changes.
High Memory Usage
- The entire metadata set is kept in memory and replicated across all nodes.
- This increases RAM usage and makes scaling harder.
Slow Recovery After Crash
- If the Cluster Manager crashes, it has to rebuild all metadata from followers, which takes time.
- No persistent storage for metadata means recovery is slow.
Not Cloud-Native
- OpenSearch needs a stateless Cluster Manager for better cloud scaling.
- Right now, it depends too much on in-memory metadata storage.

Describe the solution you'd like

Proposed Solution: Use RocksDB for Metadata Storage

Instead of keeping metadata only in memory, I am proposing , we can store it in RocksDB (a fast key-value database).
This will decouple metadata storage from ClusterState, making OpenSearch faster and more scalable.

How This Helps

Faster Index Creation & Updates
- Metadata changes go to RocksDB first instead of blocking ClusterState.
- RocksDB handles high-speed writes with less memory usage.
Better Memory Efficiency
- Metadata isn’t stored fully in RAM anymore.
- RocksDB loads only what’s needed, reducing memory load.
Crash Recovery is Instant
- RocksDB uses a Write-Ahead Log (WAL), so if a Cluster Manager crashes, metadata is not lost.
- On restart, it replays the WAL and restores unflushed changes.
Supports Stateless Cluster Manager
- OpenSearch can run without storing all metadata in memory.
- A new Cluster Manager node can pick up metadata from RocksDB instantly.

Implementation Plan

1️⃣ Store Metadata in RocksDB

Each metadata item (index settings, aliases, templates) is stored as a key-value pair in RocksDB.

Example format:

Key: "index:my-index"
Value: { "shards": 5, "replicas": 2, "aliases": ["logs"] }

Key: "alias:logs"
Value: { "target": "my-index" }

Key: "template:default"
Value: { "mappings": { "type": "text" }, "settings": { "refresh_interval": "1s" } }

Related component

Cluster Manager

Describe alternatives you've considered

Evaluating Metadata Storage Solutions for OpenSearch Cluster Manager

Comparison of Metadata Storage Options

Option	Pros	Cons	Best Use Case
RocksDB	Fast reads and writes with microsecond latency. Embedded within OpenSearch. Works in on-premise and cloud environments. Provides durability through Write-Ahead Logging (WAL).	Uses local disk, which is not shared across nodes. Requires compaction tuning for optimal performance.	High-speed metadata updates. On-premise or hybrid OpenSearch clusters.
DynamoDB (AWS)	Fully managed and highly scalable. No maintenance required. Cloud-native with multi-AZ support.	Higher read and write latency due to network overhead. Limited to AWS, making multi-cloud support difficult. Expensive at a large scale.	AWS-based OpenSearch deployments. Serverless OpenSearch.
S3 (AWS)	Durable object storage. Good for backups and long-term metadata storage.	Very slow for real-time operations. Not a key-value store and lacks transactions. Not suitable for live metadata operations.	Storing metadata snapshots instead of live queries.
FoundationDB	ACID-compliant with strong consistency. Scales horizontally as a distributed key-value store.	Complex to deploy and manage. Not widely used in OpenSearch.	Distributed metadata storage across multiple OpenSearch nodes.
Etcd	Strong consistency with built-in leader election. Reliable and widely used in Kubernetes.	Slower writes at large scale. Requires extra cluster management.	Small-scale cluster metadata coordination.
Apache Zookeeper	Strong consistency with leader election. Well-established for coordination tasks.	Slower writes and not optimized for large-scale metadata. Operational complexity for managing nodes.	Metadata coordination for leader election and configuration storage.
PostgreSQL	ACID-compliant with transaction support. Scalable through read replicas.	Slower than LSM-based storage like RocksDB. Higher operational cost and complexity.	When metadata requires SQL queries and complex transactions.
Raft-based Storage (Custom Solution)	Fully distributed across OpenSearch nodes. Strong consistency and failover support.	Complex to implement and maintain. Requires custom metadata synchronization logic.	Fully decentralized metadata storage without a separate external system.

Why RocksDB is the Best Fit for OpenSearch

Fast Reads and Writes

RocksDB provides microsecond-level read and write latency because it runs locally within OpenSearch. Other options like DynamoDB and S3 introduce network overhead, increasing response times.

Strong Consistency and Durability

RocksDB ensures strong consistency through Write-Ahead Logging (WAL), allowing metadata changes to be recovered after a crash. DynamoDB is eventually consistent unless configured for strong consistency, which adds cost and complexity. S3 does not provide transactional consistency.

Works in Any Deployment Environment

RocksDB is embedded in OpenSearch and does not rely on external cloud services. It works in on-premise, hybrid, and multi-cloud environments. In contrast, DynamoDB and S3 are AWS-only solutions, making them unsuitable for OpenSearch clusters deployed across multiple cloud providers or on-premise.

Lower Operational Overhead

RocksDB runs inside OpenSearch without requiring external database management. There are no API rate limits, IAM role configurations, or additional networking costs. In comparison, DynamoDB requires fine-tuning capacity settings and monitoring costs, while S3 is not built for real-time metadata storage.

High-Throughput Metadata Operations

OpenSearch needs to perform fast, atomic metadata updates such as index creation and alias management. RocksDB supports millions of writes per second with batch transactions. DynamoDB scales well but incurs additional latency due to network requests. S3 is not suitable for live metadata updates.

Would love to gather feedback on this choice before moving forward with a proof of concept.

Additional context

No response

The text was updated successfully, but these errors were encountered:

rajiv-kv · 2025-02-14T12:45:45Z

@Vikasht34 - Thanks for filing the issue.

ClusterState is persisted to disk as Lucene documents for recovery and is not reconstructed always from follower nodes. Do you also want to consider this in your analysis ?

Fast Reads and Writes

Do we need to substantiate with Perf numbers to compare with in-memory access ? I understand it is better than network access but might be slower than fetching from heap and impacting some of the critical flows.

Decoupling

There is some level of decoupling on how the inmemory ClusterState is persisted to store. Please take a look at the PersistedState interface and how it is leveraged for persisting to S3.

Vikasht34 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 12, 2025

github-actions bot added the Cluster Manager label Feb 12, 2025

github-project-automation bot added this to Cluster Manager Project Board Feb 12, 2025

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Feb 12, 2025

Vikasht34 mentioned this issue Feb 12, 2025

[RFC] Refactor Metadata management in Cluster Manager code as separate lib #13197

Open

rajiv-kv added the Roadmap:Modular Architecture Project-wide roadmap label label Feb 14, 2025

opensearch-infra bot added this to OpenSearch Roadmap Feb 14, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Feb 14, 2025

rajiv-kv removed the untriaged label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

Vikasht34 commented Feb 12, 2025

rajiv-kv commented Feb 14, 2025

[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

[Feature Request] Improve OpenSearch Cluster Manager Metadata Handling with RocksDB #17337

Comments

Vikasht34 commented Feb 12, 2025

Is your feature request related to a problem? Please describe

Problem

Describe the solution you'd like

Proposed Solution: Use RocksDB for Metadata Storage

How This Helps

Implementation Plan

1️⃣ Store Metadata in RocksDB

Related component

Describe alternatives you've considered

Evaluating Metadata Storage Solutions for OpenSearch Cluster Manager

Comparison of Metadata Storage Options

Why RocksDB is the Best Fit for OpenSearch

Fast Reads and Writes

Strong Consistency and Durability

Works in Any Deployment Environment

Lower Operational Overhead

High-Throughput Metadata Operations

Additional context

rajiv-kv commented Feb 14, 2025