Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid string copy in sorensen dice #64

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

maxbachmann
Copy link
Member

Since we only need to iterate over the bigrams for each string once, we can create them lazily instead of collecting them into a string. This reduces the binary size by around 7%. In addition it reduces runtime in our current benchmark by around 11%.

For reference in my example binary this gives:

File  .text     Size Crate
6.0%  96.7% 275.8KiB std
0.1%   1.5%   4.3KiB strsim
0.0%   0.0%     124B rf_test
0.0%   0.0%     102B [Unknown]
6.3% 100.0% 285.3KiB .text section size, the file size is 4.5MiB

while previously it was:

File  .text     Size Crate
6.1%  96.6% 276.5KiB std
0.1%   1.6%   4.6KiB strsim
0.0%   0.0%     124B rf_test
0.0%   0.0%     102B [Unknown]
6.3% 100.0% 286.3KiB .text section size, the file size is 4.5MiB

This reduces the binary size by around 7%. In our benchmark this reduces
runtime by around 11%.
@maxbachmann maxbachmann marked this pull request as draft January 4, 2024 11:57
@maxbachmann
Copy link
Member Author

It should be possible to improve this quite a bit further by using the same hashmap used for the damerau-levenshtein implementation. I made a quick experiment which reduced runtime by another 64% and while reducing binary size by another 38%. This version was just a quick experiment and doesn't calculate the correct score yet. So it could just be faster + smaller since it's broken 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant