reimplement damerau levenshtein distance #61
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This reimplements the damerau levenshtein distance based on the paper
Linear space string correction algorithm using the Damerau-Levenshtein distance
from Chunchun Zhao and Sartaj Sahni. In addition it uses a custom hashmap which is both smaller and faster for this workload.This leads to the following improvements:
This is based on #57 and only implements the new algorithm for strings, while the generic version continues to use the old version. For strings this is better in all regards, while for the generic version it's not as simply:
Since here we want to focus on binary size I would just recommend rapidfuzz-rs for users who really care about performance for now. I will probably take another stab at this at some point, but the string version should be more important for users anyways.
For reference here is a binary size report for the string version:
Before the change:
After the change: