Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement damerau levenshtein distance #61

Merged
merged 1 commit into from
Jan 4, 2024
Merged

Conversation

maxbachmann
Copy link
Member

This reimplements the damerau levenshtein distance based on the paper Linear space string correction algorithm using the Damerau-Levenshtein distance from Chunchun Zhao and Sartaj Sahni. In addition it uses a custom hashmap which is both smaller and faster for this workload.

This leads to the following improvements:

  • reduced memory usage from O(N*M) to O(N+M)
  • reduces runtime in our own benchmark by more than 70%
  • reduces binary size by more than 25%

This is based on #57 and only implements the new algorithm for strings, while the generic version continues to use the old version. For strings this is better in all regards, while for the generic version it's not as simply:

  • when using the same custom hashmap we would no longer support arbitrary hashable items
  • when using it with the standard hashmap the performance is not drastically better, but the binary size is larger.

Since here we want to focus on binary size I would just recommend rapidfuzz-rs for users who really care about performance for now. I will probably take another stab at this at some point, but the string version should be more important for users anyways.

For reference here is a binary size report for the string version:
Before the change:

File  .text     Size Crate
5.7%  95.7% 258.0KiB std
0.1%   2.4%   6.5KiB strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.9% 100.0% 269.6KiB .text section size, the file size is 4.4MiB

After the change:

File  .text     Size Crate
5.6%  96.2% 255.6KiB std
0.1%   1.9%   5.0KiB strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.9% 100.0% 265.6KiB .text section size, the file size is 4.4MiB

@maxbachmann
Copy link
Member Author

@dguo this is no longer a breaking change, so we could get this in for this release as well.

This was referenced Jan 1, 2024
Copy link
Member

@dguo dguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I don't know of a good way to avoid the changelog merge conflicts besides just dealing with them one at a time as we merge.

This reimplements the damerau levenshtein distance based on the paper
`Linear space string correction algorithm using the Damerau-Levenshtein distance`
from Chunchun Zhao and Sartaj Sahni. In addition it uses a custom hashmap which is both smaller and
faster for this workload.

This leads to the following improvements:
- reduced memory usage from O(N*M) to O(N+M)
- reduces runtime in our own benchmark by more than 70%
- reduces binary size by more than 25%
@maxbachmann maxbachmann force-pushed the damerau_levenshtein2 branch from c610e9f to 69f49ad Compare January 4, 2024 07:38
@maxbachmann maxbachmann merged commit 9d0a950 into main Jan 4, 2024
12 checks passed
@maxbachmann maxbachmann deleted the damerau_levenshtein2 branch January 4, 2024 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants