Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate potential x86 varint optimization #279

Open
danburkert opened this issue Feb 14, 2020 · 4 comments
Open

Investigate potential x86 varint optimization #279

danburkert opened this issue Feb 14, 2020 · 4 comments

Comments

@danburkert
Copy link
Collaborator

https://www.reddit.com/r/rust/comments/f36j05/comment/fhhwqp9

@danburkert
Copy link
Collaborator Author

see also https://github.com/gnzlbg/bitintr for safe and cross platform wrappers over the intrinsics

@danburkert
Copy link
Collaborator Author

@danburkert
Copy link
Collaborator Author

@as-com
Copy link

as-com commented Jan 2, 2021

So I did some quick and dirty prototyping with varint-simd v0.3.0, and here's what I found:

  • Microbenchmark varint performance is only improved for encoding and decoding larger numbers
  • Encoding performance generally fares better than decoding performance, depending on where I place the branch that uses varint-simd
  • Macrobenchmark performance is mostly a wash (tested on Coffee Lake), with some larger wins and some smaller losses

This is probably because the only encode/decode function is for single u64's, which is currently a weak point for varint-simd (it's not that much faster than other implementations when decoding/encoding tiny u64's).

I suspect there will need to be some larger-scale refactoring to take full advantage of varint-simd. For example, protobuf tags are up to 32 bits long, so a lot of cycles can be saved when encoding/decoding those. 

My library also just added support for quickly decoding two, four, and eight adjacent varints in parallel (subject to size limitations), with some really good throughput figures - most of the time, protobufs will be a 32 bit tag followed by a 32 bit number or length, and decode requests can be shrunk based on how large the data field is in the .proto file. So there's likely a lot more gains to be had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants