Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Move Lucene Vector field and HNSW KNN Search as a first class feature in core #17338

Open
sam-herman opened this issue Feb 12, 2025 · 5 comments
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search Search query, autocomplete ...etc

Comments

@sam-herman
Copy link
Contributor

sam-herman commented Feb 12, 2025

Is your feature request related to a problem? Please describe

I can't really move this issue between projects, but will be copy pasting this great suggestion by @nknize and add a little bit more context to it for the issues I'm seeing while attempting to integrate jVector opensearch-project/k-NN#2386

Is your feature request related to a problem?
Core OpenSearch does not support Vector types as a first class field. The correlation engine has a CorrelationVectorFieldMapper that uses Lucene's KNNFloatVectorField but this is in the events-correlation-engine plugin. We could move that field mapper to the core library, but we don't want to fragment between different vector field implementations. So why not move the Lucene HNSW backed vector field and Knn search as a first class field in a core library?

What solution would you like?
A discussion around making vector field type as a first class citizen in core. We've discussed this before in "person" but I don't know if we have an issue around it. I don't think there's a reason to not have Lucene vector fields and HNSW backed KNN search as a core feature and leverage the OpenSearch kNN plugin as an optional accelerator using alternative native options like FAISS or nmslib?

What alternatives have you considered?
Leave as is if there is a compelling reason to keep this base Lucene capability integration in a separate downstream plugin.

Do you have any additional context?
We were trying to extend the k-NN plugin for jVector engine and encountered several issues with the existing approach that convinced us that core would be a better fit for vector types and vector search going forward.
The issues can be enumerated as follows:

  1. Significant complexity and maintainability issues - Those are caused primarily due to the decision of including native libraries as a whole delegate to index/search functionalities and causing quite a few issues:
    a. build and interfaces in the plugin are quite complex and often break. This is primarily due to some of the native libraries not well thought out inclusion of source code dependencies. Also some versions are not backwards compatible etc.
    b. Native memory - native memory makes a lot of difficulties to track and analyze performance issues. JVM analysis will have hard time detecting such issues and not all users would like them
  2. Maintainers choice of engines - The KNN plugin maintainers have a clear preference for some engines (e.g. NmsLib, Faiss) while others from other organizations have preference for JVM based engines (e.g. Lucene, jVector). The KNN plugin became the gate keeper of which engines can be included in OpenSearch which is not aligned with making the project easily extendible. At the moment every new engine extension outside of the plugin would have to copy mapping/query logic which will result in divergence.

The above proposal should make new extensions into OpenSearch easier and less contentious. Satisfy different community needs such as:

  1. Native vs non native
  2. more or less agile development based on specific requirements of individual engines (e.g. have many local pre-reqs installed or not)
  3. More engine diversity without redundancy of logic
@sam-herman sam-herman added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 12, 2025
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Feb 12, 2025
@sandeshkr419
Copy link
Contributor

@navneet1v @kotwanikunal What do you think about this?

@sandeshkr419 sandeshkr419 added Performance This is for any performance related enhancements or bugs and removed untriaged labels Feb 12, 2025
@navneet1v
Copy link
Contributor

adding @vamshin

@jmazanec15
Copy link
Member

A discussion around making vector field type as a first class citizen in core. We've discussed this before in "person" but I don't know if we have an issue around it. I don't think there's a reason to not have Lucene vector fields and HNSW backed KNN search as a core feature and leverage the OpenSearch kNN plugin as an optional accelerator using alternative native options like FAISS or nmslib?

I agree with this generally. IMO:

  1. "knn" query and "knn_vector" field should be moved to core
  2. Engine extension points should be added to core
  3. k-NN plugin should transform into "vector-search-accelerator", implementing custom engine extension points (i.e. faiss)

However, I think that there should not be multiple "vector" field types in the distribution - i.e. vector vs knn_vector. From a user
perspective, given all of the docs, blogs, clients, and existing users, this would present a confusing and incohesive experience. As an analogy, having a fp32 field and a float field would similarly be confusing. Therefore, the field type and query type should be consistent (and bwc) with existing field type, knn_vector and query type, knn. This is where it gets tricky. Are you proposing that there should be a new, different, field type?

Maintainers choice of engines - The KNN plugin maintainers have a clear preference for some engines (e.g. NmsLib, Faiss) while others from other organizations have preference for JVM based engines (e.g. Lucene, jVector). The KNN plugin became the gate keeper of which engines can be included in OpenSearch which is not aligned with making the project easily extendible. At the moment every new engine extension outside of the plugin would have to copy mapping/query logic which will result in divergence.

I dont think this is completely accurate. We deprecated nmslib in 2.19 and we support lucene. Its more so hesitation around maintenance burden another engine would pose, which can be heavy. But, I do think itd be good to open up for custom implementations.

@andrross
Copy link
Member

I agree with this generally. IMO:

  1. "knn" query and "knn_vector" field should be moved to core
  2. Engine extension points should be added to core
  3. k-NN plugin should transform into "vector-search-accelerator", implementing custom engine extension points (i.e. faiss)

This seems like a reasonable architecture to move towards to me as well, but I admit I don't know specifically what it would take to accomplish it. @sam-herman Have you scoped out the effort or any high level plan on what it would take to get this done?

Also, hardly the biggest issue here, but if we could get rid of the need to specify "knn": true in the index settings when adding a vector field that would be great!

@sam-herman
Copy link
Contributor Author

I agree with this generally. IMO:

  1. "knn" query and "knn_vector" field should be moved to core
  2. Engine extension points should be added to core
  3. k-NN plugin should transform into "vector-search-accelerator", implementing custom engine extension points (i.e. faiss)

This seems like a reasonable architecture to move towards to me as well, but I admit I don't know specifically what it would take to accomplish it. @sam-herman Have you scoped out the effort or any high level plan on what it would take to get this done?

Also, hardly the biggest issue here, but if we could get rid of the need to specify "knn": true in the index settings when adding a vector field that would be great!

@andrross agree that getting rid of "knn":true would be great! I should have a detailed design suggestion coming probably sometimes in the next couple weeks as I am prototyping right how it can look like.
But for the most part for new vector plugins it's really pretty much a Lucene format extension of PerFieldKnnVectorFormat so shouldn't be too bad. I'll try posting something for review pretty soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants