-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
various improvements #31
Conversation
not everything needs to be in one file testing public interfaces, so no need to be within lib
redundant now with big test block moved out of lib
- shorter first para gives tidier documentation - hyperlinks added that can be clicked on when reading documentation - quoted literals
utterly pointless to be doing the matching
use helper function for doing the comparison this also gives a nice `assert_eq` style output on failure, enabling you to see the actual value generated, if ever of interest
much clearer to have opening brace on next line with multi-line conditions
better clarity of purpose - these are char counts, not to be confused with a string's len() value
can return early if one is an empty string, no need to waste effort counting chars
only count chars where needed this could be written in fewer lines, but this seems much more readable
char counts are not used if strings are equal, so count after comparison
as just done with levenshtein
as per levenshtein
to be used shortly for optimisation
use new split_on_common_prefix() helper In cases of two non-equal strings with equal byte length, some portion of them would be compared in the equality check, before the check fails and it moved on to a more thorough calculation. Unfortunately, the information about what length of the start of the strings did match was thrown away (not made available from the basic equality check), which would have been useful, since we could skip past it (the common prefix has zero value). The new helper allows precisely this. It finds the common prefix, splitting the strings after it. We can then ignore it, and proceed on with just the suffixes that remain (if any).
The new 'normalised' form does char counting on top of the standard algorithm it calls. This change avoids unnecessary recalculation by moving the main algorithm to a private function which takes `Option` wrapped counts, allowing the normalised form to pass in the values it calculates, thus letting this be done only once.
avoid floats for simply counting, convert to float at end where it needs to actually be a float for the final calculation
the strings are already `&str`
can't expect users to have read the readme
as per levenshtein optimisation rapidfuzz#2
as per levenshtein optimisation rapidfuzz#2
`usize` implements copy, thus we should prefer copying rather than cloning, as recommended by the std documentation
rather than copy prev → prev2 then curr → prev, we can simplify and optimise by just switching the prev & prev2 vecs, and copying curr → prev.
avoid clone() on a usize, it's copy-able so let's just copy it
more simple, and construction with vec! here should be more optimal over a push() loop (see the implementation which uses the private extend_with() function)
use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3
little hard to walk through all of the std code, but I'm pretty certain two separate range collect()'s are more efficient than the push() loop. it's also more simple
same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char counting with 'normalised' form.
`RangeInclusive` would have been most appropriate, but no direct access to inner attributes
the benefit of being a macro is that test failure output will show the file and line number of the assertion that failed, within the test, rather than pointing within the helper, which helps pinpoint the location better than just having the name of the test function
This optimisation addresses the fact that in the Jaro-Winkler implementation, any common prefix is processed to determine its char count in addition to this portion of the stirngs also being processed within the use of the standard Jaro algorithm. Here we move the Jaro implementation to an 'inner' function, which takes an additional parameter - the char count of a common prefix that has been removed from the provided pair of strings. The algorithm can then take this into account in calculating the metric. The standard Jaro public API function calls this with both full strings and a value of zero for this new param. The Jaro-Winkler implementation now uses the prefix-splitting helper, passing the pair of suffixes in to the 'inner' Jaro algorithm, along with the char-count of the prefix. The adjustment is then performed as before. This avoids wasting unnecessary effort on the prefix portion within the use of the Jaro algorithm.
While `get_diverge_indice` retrieves the indice as before, it now keeps track of the char count while it is looping over chars, so this can be used where wanted (jaro-winkler) without a separate `.chars().count()`. Hopefully, with `inline(always)` the counting will be optimised away in the implementations that do not use the count.
use the char count returned from the updated helper
Taking the j-w optimisations further, this makes use of the prefix splitting helper within the inner Jaro algorithm. The function has been modified such that instead of taking a char-count of the size of the common prefix removed from the pair of strings, it now optionally takes a pointer to return the count, obtaining it within the function through use of the helper internally. Using the prefix splitting helper within the function means that we avoid doing a `.chars().count()` iteration over the prefix twice, once going over `a` and once going over `b`. It also then allows the main part of the algorithm to completely avoid processing the common prefix portion of the strings.
Hey, I'm sorry for taking so long to get to this. I do appreciate all the work you put into it. I know you requested that I not ask you to break it up into smaller pull requests, but it is just too big. Would you be open to me helping you do so? I could go through the commits and make a checklist of discrete changes. So at least you wouldn't have to do the work of figuring out how to split it up. |
Sure, a plan would be great :) |
to be used with suggestion matching. this implementation is based upon the implementation from the `strsim` Rust crate, authored by Danny Guo, available at [1]; more specifically the (as yet un-merged) optimised copy authored by myself, available at [2]. the code is available under the MIT license. one implementation difference is that we use floats rather than doubles since we only intend to use this for suggestion matching in unknown option error messages and we don't need the extra precision. [1]: https://github.com/dguo/strsim-rs [2]: rapidfuzz/strsim-rs#31
It has been quite a while since back then and a few things have changed in the meantime. One of the major changes is that going forward we will focus on implementing algorithms with reasonable performance, while keeping the binary size small. People who care about the maximum performance will be recommended to use https://github.com/rapidfuzz/rapidfuzz-rs which focuses more on performance, but has a larger binary size. From a quick look over the changes they can be grouped into a couple of buckets:
I agree with @dguo that if we want to have even part of this in, we would need to split this up into multiple separate PRs. In the current form it's simply not possible to review this properly. @jnqnfe are you still interested in working on these things? |
Currently I have far too much on my plate to do any further work on this and I don't see that changing any time soon, sorry. It's taken me three days to even find time to provide this reply. |
That's completely understandable. I just wanted to give you the option to continue working on this, since I know you did put quite a bit of work into this back then. I did look through the changes a bit more:
Not sure I missed anything small, but finding it would require this to be rebased which would be a significant amount of work. For this reason I will close this PR. Thanks again for the work you did put into this. I know the review has taken us quite a while 😅 |
please don't request breaking them into separate pull requests, that would be horrendous :)
I have no yet found time to review with respect to #30