Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diskann Benchmarking Wrapper #260

Open
wants to merge 161 commits into
base: branch-25.04
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 29, 2024

Brings DiskANN into cuvs-bench

  • Build and search in-memory DiskANN index
  • Build and search SSD DiskANN index
  • Build a cuvs Vamana index on GPU and serialize it in DiskANN format. Search on CPU using in-memory DiskANN search API.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 29, 2024
@tarang-jain tarang-jain self-assigned this Jul 29, 2024
@github-actions github-actions bot added the Python label Aug 3, 2024
@tarang-jain
Copy link
Contributor Author

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Feb 7, 2025

/ok to test

achirkin and others added 3 commits February 7, 2025 14:24
… copy beyond 4B elems (rapidsai#671)

ann-bench keeps data dimensions as `uint32_t`. We use `std::fread` to copy the data from a file to the host memory and pass `n_rows * n_cols` there, which gets casted to size_t only after the multiplication. This leads to integer overflow for the datasets larger than 4B elements and a partial data copy.

This PR fixes the bug by casting the dimensions before the multiplication.
The bug only affects the benchmark cases where the data is requested in the host memory not backed by a file.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#671
@gforsyth
Copy link
Contributor

gforsyth commented Feb 7, 2025

CMakeLists.txt from the upstream DiskANN repo is setting CMAKE_CXX_COMPILER explicitly to g++ which is skipping the g++ compiler in the conda environment and as a result, using the wrong ld.

These lines should be added to the diff so they can be removed when that file gets patched:
https://github.com/tarang-jain/DiskANN/blob/cmake/CMakeLists.txt#L26-L28

My phrasing above is bad, to be clearer, those lines should be removed.

dantegd and others added 4 commits February 7, 2025 19:16
PR does the following:

- [x] Modifies CI to run pytest and e2e test of cuvs-bench
    - [x] We need to test the additional time needed to run the tests. They should be fast, but if they are not, then we can add an additional job to run them in parallel.
- [x] Adds synthetic test-data generation so the CI jobs don't depend on downloading datasets, and users can have easy testing locally. 
    - [ ] Few improvements to be done to docs, yaml and other things to make it easy for users.
- [x] Check in some additional pytests that hadn't been checked in before.

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Micka (https://github.com/lowener)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#574
@tarang-jain
Copy link
Contributor Author

/ok to test

@cjnolet cjnolet changed the base branch from branch-25.02 to branch-25.04 February 8, 2025 02:46
@cjnolet cjnolet requested review from a team as code owners February 8, 2025 02:46
@github-actions github-actions bot added the ci label Feb 8, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 8, 2025

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking ci CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.