-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
another possible issue to address #2130
Comments
Update 1:Using the BGE model on the same set of documents, it completed successfully. However, my system memory as maxed at 128 GB, which, again, never happened before. Thus, I cannot conclude that I've also upgraded langchain-related libraries (e.g. huggingfaceembeddings and tiledb.py) at the same time, and Please keep this open if you don't mind and I'll try to give updates. Thanks. Just to keep a record for myself for troubleshooting:
Personal note:
My analysis of Langchain's tiledb.py (work in progress)The vector_type cannot be directly specified when using from_texts. The parameter isn't exposed in the from_texts method signature, and looking at the code flow:
input_vectors = np.array(embeddings).astype(np.float32) Then uses that array's dtype when calling create: vector_type=input_vectors.dtype Looking at both code pieces together, there's an important interaction to note: In your script, you control the dtype of the embedding models using torch_dtype which can indeed be bfloat16 or float16. This means that even if your embedding model is producing vectors in bfloat16 or float16, they are being converted to float32 before being stored in the TileDB database. So while you may save memory and potentially get better performance during the embedding computation by using lower precision dtypes in your models, the final vectors stored in the database will always be float32. You make some excellent points that expose a potential inconsistency in the code. The create method does indeed expose vector_type as a parameter, suggesting the underlying TileDB library can handle different vector types. input_vectors = np.array(embeddings).astype(np.float32)
cls.create(
...
vector_type=input_vectors.dtype,
...
) This suggests one of three possibilities: This could be a limitation or oversight in the langchain implementation - they might be unnecessarily converting to float32 when the underlying TileDB library could actually handle other types. For some strange reason, despite our conversation and your statements that the langchain source code, as currently written, uses float32, I received this error when trying to create a vector database with my script: Traceback (most recent call last):
[partially omitted for brevity]...
File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 726, in from_texts
return cls.from(
^^^^^^^^^^^
File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 559, in from
input_vectors = np.array(embeddings).astype(np.float32)
^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 24.5 GiB for an array with shape (3214248, 1024) and data type float64 It mentions "float64" for some unexplained reason, but perhaps you can explain it based on all of the source code and my script that we've been analyzing? From Claude 3.5 Sonnet:This is an interesting situation! The error occurs because of how numpy handles array creation when not explicitly given a dtype. embeddings is a list of lists containing Python floats |
If someone could do a sanity check for me please I'd appreciate it. I'm not an expert in these matters and it's just a hobby, but my intuition tells me that the issue has something to do with the #2076 change, which resolved #2128, which mentioned that there was some kind of issue with uint64? Sorry, this isn't my expertise but hopefully it helps. I'm willing to spend more time on this but would like someone to doublecheck I'm not completely off base here so I don't waste time. Claude's conclusion is within the pulldown section of my prior post named "My analysis of Langchain's tiledb.py (work in progress)" The pull request #2076 I mentioned was the first one after |
Hello, if possible, it would be helpful to have a controlled test using the same model and input script, changing only
and share the output in both cases. That may give us enough to debug. Also, if you have a minimal example of the steps to reproduce this issue, we'll try to take a look (depending on the complexity of the setup, and RAM requirements). |
I added the line you suggested to langchain's tiledb.py script and it gave me an error, in relevant part...
AI says it's because "embeddings is a Python list, which does not have attributes like shape, nbytes, or dtype" |
Tested with
|
Is the core problem regarding outputting embeddings in bfloat16 due to numpy's limitation? I realize that this thread started with the strange "float64" error, but if you could confirm that this is a limitation of numpy itself that'd negate further troubleshooting I was doing regarding why the embeddings have a dtype of float32... Here is a summary of my research I'm asking you to confirm... While many modern transformer models can compute in BFLOAT16 internally, this precision is lost due to NumPy's lack of native BFLOAT16 support. |
That should be fixed with 0.33.2.
The other thing to test in the single-file case would be
Overall, what we need is to determine whether there is a difference in behavior when varying the We'll run some tests with tiledb-vector-search using updated tiledb-py and the same langchain version, and see if there are any problems we can isolate. |
Thanks, I'll also try testing the same exactly 100,000k files as before, it just takes a lot of time. I have no explanation as to why the single file wouldn't present the "float64" problem (even when i was using a Can you please respond to my other hypothesis regarding dtypes, numpy, etc. so i don't waste time? This is more of a courtesy than strictly troubleshooting the "float64" issue. |
Here's a summary:
Does that succinctly summarize the current state of affairs? Is it possible to at least modify tiledb.py to formally support |
Possibly related to #2127 but I'm not sure because I'm not an expert in these things.
Here is the error:
Same as the last "issue," I was using the scripts located in my repo, specifically
database_interactions.py
located here: https://github.com/BBC-Esq/VectorDB-Plugin-for-LM-Studio/blob/main/src/database_interactions.pyThis time I was using a model that I recently added to my program:
https://huggingface.co/dunzhang/stella_en_1.5B_v5
To rule out it being solely attributable to this new embedding model, I'm vectorizing for my db the same exact files using the same exact
dtype
(bfloat16) I used with the stella model and will update and/or delete this issue as appropriate after that test is done. I'll use this model:https://huggingface.co/BAAI/bge-large-en-v1.5
If I get the same error with the BGE model (which unequivocally worked before) it seems like it might actually be an issue with tiledb...
The text was updated successfully, but these errors were encountered: