Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another possible issue to address #2130

Open
BBC-Esq opened this issue Jan 4, 2025 · 9 comments
Open

another possible issue to address #2130

BBC-Esq opened this issue Jan 4, 2025 · 9 comments
Assignees

Comments

@BBC-Esq
Copy link

BBC-Esq commented Jan 4, 2025

Possibly related to #2127 but I'm not sure because I'm not an expert in these things.

Here is the error:

Traceback (most recent call last):
  File "C:\Users\Airflow\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Users\Airflow\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 108, in run
    self._target(self._args, **self._kwargs)
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\database_interactions.py", line 29, in create_vector_db_in_process
    create_vector_db.run()
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\database_interactions.py", line 420, in run
    self.create_database(texts, embeddings)
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\database_interactions.py", line 268, in create_database
    db = TileDB.from_texts(
         ^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 726, in from_texts
    return cls.from(
           ^^^^^^^^^^^
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 559, in from
    input_vectors = np.array(embeddings).astype(np.float32)
                    ^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 24.5 GiB for an array with shape (3214248, 1024) and data type float64

Same as the last "issue," I was using the scripts located in my repo, specifically database_interactions.py located here: https://github.com/BBC-Esq/VectorDB-Plugin-for-LM-Studio/blob/main/src/database_interactions.py

This time I was using a model that I recently added to my program:

https://huggingface.co/dunzhang/stella_en_1.5B_v5

To rule out it being solely attributable to this new embedding model, I'm vectorizing for my db the same exact files using the same exact dtype (bfloat16) I used with the stella model and will update and/or delete this issue as appropriate after that test is done. I'll use this model:

https://huggingface.co/BAAI/bge-large-en-v1.5

If I get the same error with the BGE model (which unequivocally worked before) it seems like it might actually be an issue with tiledb...

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 4, 2025

Update 1:

Using the BGE model on the same set of documents, it completed successfully. However, my system memory as maxed at 128 GB, which, again, never happened before. Thus, I cannot conclude that tiledb is NOT the issue. It's possible that I only received the error with the "stella" model because it "relatively" uses more memory than the "bge" model...both models suffered from the excessive memory use compared to before upgrading the tiledb library, and perhaps it's just that "little extra" from Stella that caused the error.

I've also upgraded langchain-related libraries (e.g. huggingfaceembeddings and tiledb.py) at the same time, and tiledb-vector-search, etc. So I'll try to narrow it down.

Please keep this open if you don't mind and I'll try to give updates. Thanks.

Just to keep a record for myself for troubleshooting:

  • Working dependencies with no excessive memory use:
    • langchain-community==0.2.17
    • langchain-core==0.2.43
    • langchain-huggingface==0.0.3
    • langchain-text-splitters==0.2.4
    • langsmith==0.1.125
    • langdetect==1.0.9
    • langchain==0.2.17
    • tiledb==0.32.5
    • tiledb-cloud==0.12.31 (requires cloudpickle <3)
    • tiledb-vector-search==0.10.3
  • Excessive memory usage as described in immediately preceding post:
    • langchain==0.3.13
    • langchain-community==0.3.13
    • langchain-core==0.3.28
    • langchain-huggingface==0.1.2
    • langchain-text-splitters==0.3.4
    • langdetect==1.0.9
    • langsmith==0.2.7
    • tiledb==0.33.0
    • tiledb-cloud==0.13.0 (works with cloudpickle 3+)
    • tiledb-vector-search==0.11.0

Personal note:

My analysis of Langchain's tiledb.py (work in progress)

The vector_type cannot be directly specified when using from_texts. The parameter isn't exposed in the from_texts method signature, and looking at the code flow:

  • from_texts gets the embeddings
  • Passes them to __from
  • __from converts the embeddings to a numpy array and forces it to float32:
input_vectors = np.array(embeddings).astype(np.float32)

Then uses that array's dtype when calling create:

vector_type=input_vectors.dtype

Looking at both code pieces together, there's an important interaction to note:

In your script, you control the dtype of the embedding models using torch_dtype which can indeed be bfloat16 or float16.
However, in the TileDB code, when __from is called from from_texts, it forcibly converts the embeddings to float32:

This means that even if your embedding model is producing vectors in bfloat16 or float16, they are being converted to float32 before being stored in the TileDB database.

So while you may save memory and potentially get better performance during the embedding computation by using lower precision dtypes in your models, the final vectors stored in the database will always be float32.

You make some excellent points that expose a potential inconsistency in the code.
Looking at both pieces together:

The create method does indeed expose vector_type as a parameter, suggesting the underlying TileDB library can handle different vector types.
However, the __from method used by from_texts forcibly converts everything to float32 before calling create:

input_vectors = np.array(embeddings).astype(np.float32)
cls.create(
    ...
    vector_type=input_vectors.dtype,
    ...
)

This suggests one of three possibilities:

This could be a limitation or oversight in the langchain implementation - they might be unnecessarily converting to float32 when the underlying TileDB library could actually handle other types.
The vector_type parameter in create might be vestigial - perhaps earlier versions of TileDB supported different types but now standardize on float32.
The langchain implementation might be intentionally restricting vector types to float32 for consistency or compatibility reasons, even though the underlying TileDB library supports more.

For some strange reason, despite our conversation and your statements that the langchain source code, as currently written, uses float32, I received this error when trying to create a vector database with my script:

Traceback (most recent call last):
 
[partially omitted for brevity]...

  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 726, in from_texts
    return cls.from(
           ^^^^^^^^^^^
  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 559, in from
    input_vectors = np.array(embeddings).astype(np.float32)
                    ^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 24.5 GiB for an array with shape (3214248, 1024) and data type float64

It mentions "float64" for some unexplained reason, but perhaps you can explain it based on all of the source code and my script that we've been analyzing?

From Claude 3.5 Sonnet:

This is an interesting situation! The error occurs because of how numpy handles array creation when not explicitly given a dtype.
When you pass the embeddings list to np.array(), numpy has to infer the dtype. By default, when numpy sees Python floating point numbers (which are double precision), it creates a float64 array. This happens before the .astype(np.float32) call.
So the process is:

embeddings is a list of lists containing Python floats
np.array(embeddings) tries to create a float64 array first (requiring 24.5 GiB)
Only then would it try to convert to float32 via .astype(np.float32)

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 4, 2025

If someone could do a sanity check for me please I'd appreciate it. I'm not an expert in these matters and it's just a hobby, but my intuition tells me that the issue has something to do with the #2076 change, which resolved #2128, which mentioned that there was some kind of issue with uint64?

Sorry, this isn't my expertise but hopefully it helps. I'm willing to spend more time on this but would like someone to doublecheck I'm not completely off base here so I don't waste time.

Claude's conclusion is within the pulldown section of my prior post named "My analysis of Langchain's tiledb.py (work in progress)"

The pull request #2076 I mentioned was the first one after tiledb version 0.32.5, which I indicated did not seem to suffer from the excessive memory usage...hence why I started my troubleshooting there. Thanks!

@ihnorton
Copy link
Member

ihnorton commented Jan 5, 2025

Hello, if possible, it would be helpful to have a controlled test using the same model and input script, changing only tiledb version between 0.32.* and 0.33.2

  • If you've already done that, if feasible, could you add a debugging line to print(embeddings.shape, embeddings.nbytes, embeddings.dtype) above this line?
input_vectors = np.array(embeddings).astype(np.float32)

and share the output in both cases. That may give us enough to debug.

Also, if you have a minimal example of the steps to reproduce this issue, we'll try to take a look (depending on the complexity of the setup, and RAM requirements).

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 6, 2025

I added the line you suggested to langchain's tiledb.py script and it gave me an error, in relevant part...

  File "D:\Scripts\VectorDB-Plugin-for-LM-Studio\working\Lib\site-packages\langchain_community\vectorstores\tiledb.py", line 559, in __from
    print(embeddings.shape, embeddings.nbytes, embeddings.dtype) # DEBUG
          ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'shape'

AI says it's because "embeddings is a Python list, which does not have attributes like shape, nbytes, or dtype"

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 6, 2025

Tested with tiledb0.32.2, tiledb-vector-search==0.13.2, tiledb-cloud==0.12.31

Instead of your approach i added this immediately before it:

        # Check if embeddings is a list
        if isinstance(embeddings, list) and embeddings:
            # Determine if it's a list of lists or a flat list
            is_nested = all(isinstance(item, list) for item in embeddings)
            structure = "List of lists" if is_nested else "Flat list"

            # Analyze the first element
            first_item = embeddings[0]
            if is_nested:
                first_item_dtype = type(first_item[0]) if first_item else None
                try:
                    first_item_shape = np.array(first_item).shape
                except Exception as e:
                    first_item_shape = f"Cannot determine shape: {e}"
            else:
                first_item_dtype = type(first_item)
                first_item_shape = ()

            # Print results
            print("Structure:", structure)
            print("First item dtype:", first_item_dtype)
            print("First item shape:", first_item_shape)
        else:
            print("Embeddings is not a non-empty list.")

This is what it printed when using the BGE model:

Structure: List of lists
First item dtype: <class 'float'>
First item shape: (1024,)

Using the "stella" model printed the same exact thing. PLEASE NOTE, this test was done on a single PDF file, not the huge corpus my prior posts were tested regarding. I can't simply test 100,000k+ court cases each time.

With that being said, I don't understand why it's showing as "float," which I think is float32 isn't it? Both BGE and Stella are being run using bfloat16.

In my database_interactions.py script it defines the "dtype" variable based on the compute device, whether a user has selected half-precision in my GUI, and the model's native precision: torch_dtype = get_appropriate_dtype(compute_device, use_half, model_native_precision). If a model is natively in float16, for example, I'm not using bfloat16 because there's no value in converting float16 to bfloat16 because of the data loss. There's value converting float32 to bfloat16 or float16 - there's no precision loss...

[EDIT - apparently the sentence transformers library itself does not allow the actual embeddings to be float16 or bfloat16, even if the embedding model is of that dtype...there's some kind of conversion going on.]

Anyways, I was unable to reproduce the "float64" error on the single file I processed. Do you really need me to spend the 2+ hours processing the 100,000k+ files to get a direct apples-to-apples comparison? If so, it'll take a bit.

My program is relatively easy to install if you want, but it does require Windows an an Nvidia GPU. Let me know if I can be of assistance with installing.

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 6, 2025

Is the core problem regarding outputting embeddings in bfloat16 due to numpy's limitation? I realize that this thread started with the strange "float64" error, but if you could confirm that this is a limitation of numpy itself that'd negate further troubleshooting I was doing regarding why the embeddings have a dtype of float32...

Here is a summary of my research I'm asking you to confirm...

While many modern transformer models can compute in BFLOAT16 internally, this precision is lost due to NumPy's lack of native BFLOAT16 support.
The ml_dtypes library (https://github.com/jax-ml/ml_dtypes) addresses this limitation by adding BFLOAT16 and other ML-focused dtypes to NumPy. Adding optional support for this library would allow users to preserve BFLOAT16 precision when working with embeddings.

@ihnorton
Copy link
Member

ihnorton commented Jan 6, 2025

I realize that this thread started with the strange "float64" error

That should be fixed with 0.33.2.

Using the "stella" model printed the same exact thing. PLEASE NOTE, this test was done on a single PDF file, not the huge corpus my prior posts were tested regarding.

The other thing to test in the single-file case would be tiledb=0.33.2 and ensure that the data type of the embeddings variable (which I believe is returned from tiledb?) is the same.

Anyways, I was unable to reproduce the "float64" error on the single file I processed. Do you really need me to spend the 2+ hours processing the 100,000k+ files to get a direct apples-to-apples comparison? If so, it'll take a bit.

Overall, what we need is to determine whether there is a difference in behavior when varying the tiledb package version, with other variables held constant. The only hypothesis I have right now is that the data type is ending up larger than it should be, leading to the request for too much memory when trying to create the numpy array from embeddings (which I assume is returned from tiledb).

We'll run some tests with tiledb-vector-search using updated tiledb-py and the same langchain version, and see if there are any problems we can isolate.

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 6, 2025

Thanks, I'll also try testing the same exactly 100,000k files as before, it just takes a lot of time. I have no explanation as to why the single file wouldn't present the "float64" problem (even when i was using a tiledb version that had that issue), but perhaps it's something to do with the sheer volume of vectors from the large amount of files...

Can you please respond to my other hypothesis regarding dtypes, numpy, etc. so i don't waste time? This is more of a courtesy than strictly troubleshooting the "float64" issue.

@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 6, 2025

Here's a summary:

  1. Both langchain and sentence transformers allow a model to be run in float32, bfloat16, and float16.
  2. Sentence Transformers "encode" method only supports float32 (and certain quantizations), which results in the embeddings themselves being in float32.
  3. Tiledb.py, langchain's integration with the tiledb library, automatically converts embeddings to float32 here:

np.array([np.array(embedding).astype(np.float32)]).astype(np.float32),

  1. Relatively newer versions of sentence-transformers support int8, uint8, binary, ubinary. Pull request here

  2. Tiledb seems to support int8 and uint8 but not the other two.

  3. Again, langchain's integration of tiledb within tiledb.py doesn't distinguish and converts everything to float32.

Does that succinctly summarize the current state of affairs? Is it possible to at least modify tiledb.py to formally support int8 and uint8 if not the other two? I noticed that @nikolaos did the initial integration in November, 2023. Is he still around at the company by change? lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants