Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use ASCII char and special container for sequences #109

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

ivan-aksamentov
Copy link
Member

This introduces AsciiChar and Seq classes which represent an ASCII-character and ASCII-string. It's a u8 and a Vec underneath, so it combines useful methods which are string-like and array-like for our application.

The main idea is that because Rust String are unicode and char are 32-bit integers, it's wasteful and slow to use that to represent genetic sequences. We should do better with u8 8-bit charters and with ASCII set assumption, e.g. no more unnecessary code splitting preservation, and we can afford more flexible and more efficient methods.

In reality, compared to the same transition in treetime project, where we observed 2x speedup, this did not introduce any significant speedup in my measurements in pangraph. However, memory occupied by sequences should reduce by a factor of 4.

Both of the new classes could use some more unit tests.

This introduces `AsciiChar` and `Seq` classes which represent an ASCII-character and ASCII-string. It's a `u8` and a `Vec` underneath, so it combines useful methods which are string-like and array-like for our application.

The main idea is that because Rust `String` are unicode and `char` are 32-bit integers, it's wasteful and slow to use that to represent genetic sequences. We should do better with `u8` 8-bit charters and with ASCII set assumption, e.g. no more unnecessary code splitting preservation, and we can afford more flexible and more efficient methods.

In reality, compared to the same transition in treetime project, where we observed 2x speedup, this did not introduce any significant speedup in my measurements in pangraph. However, memory occupied by sequences should reduce by a factor of 4.

Both of the new classes could use some more unit tests.
@mmolari
Copy link
Collaborator

mmolari commented Jan 17, 2025

Thank you for this PR! I like this change a lot, I think a 4-fold reduction in memory is a very nice achievement. It might make sense that it does not reduce time a lot, since most of the time is probably still consumed in the alignment step.

@ivan-aksamentov ivan-aksamentov merged commit a31b2ec into rust Jan 17, 2025
14 checks passed
@ivan-aksamentov ivan-aksamentov deleted the refactor/ascii-seq branch January 17, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants