Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various improvements #31

Closed
wants to merge 53 commits into from
Closed

various improvements #31

wants to merge 53 commits into from

Conversation

jnqnfe
Copy link

@jnqnfe jnqnfe commented Nov 4, 2018

please don't request breaking them into separate pull requests, that would be horrendous :)

I have no yet found time to review with respect to #30

jnqnfe added 30 commits November 3, 2018 03:23
not everything needs to be in one file
testing public interfaces, so no need to be within lib
redundant now with big test block moved out of lib
 - shorter first para gives tidier documentation
 - hyperlinks added that can be clicked on when reading documentation
 - quoted literals
utterly pointless to be doing the matching
use helper function for doing the comparison

this also gives a nice `assert_eq` style output on failure, enabling you to
see the actual value generated, if ever of interest
much clearer to have opening brace on next line with multi-line conditions
better clarity of purpose - these are char counts, not to be confused with
a string's len() value
can return early if one is an empty string, no need to waste effort
counting chars
only count chars where needed

this could be written in fewer lines, but this seems much more readable
char counts are not used if strings are equal, so count after comparison
as just done with levenshtein
as per levenshtein
to be used shortly for optimisation
use new split_on_common_prefix() helper

In cases of two non-equal strings with equal byte length, some portion of
them would be compared in the equality check, before the check fails and
it moved on to a more thorough calculation. Unfortunately, the information
about what length of the start of the strings did match was thrown away
(not made available from the basic equality check), which would have been
useful, since we could skip past it (the common prefix has zero value).

The new helper allows precisely this. It finds the common prefix, splitting
the strings after it. We can then ignore it, and proceed on with just the
suffixes that remain (if any).
The new 'normalised' form does char counting on top of the standard
algorithm it calls. This change avoids unnecessary recalculation by
moving the main algorithm to a private function which takes `Option`
wrapped counts, allowing the normalised form to pass in the values it
calculates, thus letting this be done only once.
avoid floats for simply counting, convert to float at end where it needs
to actually be a float for the final calculation
the strings are already `&str`
can't expect users to have read the readme
as per levenshtein optimisation rapidfuzz#2
as per levenshtein optimisation rapidfuzz#2
jnqnfe added 13 commits November 4, 2018 05:39
`usize` implements copy, thus we should prefer copying rather than cloning,
as recommended by the std documentation
rather than copy prev → prev2 then curr → prev, we can simplify and
optimise by just switching the prev & prev2 vecs, and copying curr → prev.
avoid clone() on a usize, it's copy-able so let's just copy it
more simple, and construction with vec! here should be more optimal over
a push() loop (see the implementation which uses the private extend_with()
function)
use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3
little hard to walk through all of the std code, but I'm pretty certain
two separate range collect()'s are more efficient than the push() loop.

it's also more simple
same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char
counting with 'normalised' form.
`RangeInclusive` would have been most appropriate, but no direct access to
inner attributes
the benefit of being a macro is that test failure output will show the
file and line number of the assertion that failed, within the test, rather
than pointing within the helper, which helps pinpoint the location better
than just having the name of the test function
This optimisation addresses the fact that in the Jaro-Winkler
implementation, any common prefix is processed to determine its
char count in addition to this portion of the stirngs also being
processed within the use of the standard Jaro algorithm.

Here we move the Jaro implementation to an 'inner' function, which
takes an additional parameter - the char count of a common prefix
that has been removed from the provided pair of strings. The
algorithm can then take this into account in calculating the metric.

The standard Jaro public API function calls this with both full
strings and a value of zero for this new param.

The Jaro-Winkler implementation now uses the prefix-splitting
helper, passing the pair of suffixes in to the 'inner' Jaro algorithm,
along with the char-count of the prefix. The adjustment is then
performed as before. This avoids wasting unnecessary effort on
the prefix portion within the use of the Jaro algorithm.
While `get_diverge_indice` retrieves the indice as before, it now keeps
track of the char count while it is looping over chars, so this can be
used where wanted (jaro-winkler) without a separate `.chars().count()`.

Hopefully, with `inline(always)` the counting will be optimised away in
the implementations that do not use the count.
use the char count returned from the updated helper
Taking the j-w optimisations further, this makes use of the prefix
splitting helper within the inner Jaro algorithm.

The function has been modified such that instead of taking a char-count of
the size of the common prefix removed from the pair of strings, it now
optionally takes a pointer to return the count, obtaining it within the
function through use of the helper internally.

Using the prefix splitting helper within the function means that we avoid
doing a `.chars().count()` iteration over the prefix twice, once going
over `a` and once going over `b`. It also then allows the main part of the
algorithm to completely avoid processing the common prefix portion of the
strings.
@dguo
Copy link
Member

dguo commented Apr 8, 2019

Hey, I'm sorry for taking so long to get to this. I do appreciate all the work you put into it. I know you requested that I not ask you to break it up into smaller pull requests, but it is just too big. Would you be open to me helping you do so? I could go through the commits and make a checklist of discrete changes. So at least you wouldn't have to do the work of figuring out how to split it up.

@jnqnfe
Copy link
Author

jnqnfe commented Apr 11, 2019

Sure, a plan would be great :)

vlc-mirrorer pushed a commit to videolan/vlc that referenced this pull request Dec 18, 2021
to be used with suggestion matching.

this implementation is based upon the implementation from the `strsim` Rust
crate, authored by Danny Guo, available at [1]; more specifically the (as
yet un-merged) optimised copy authored by myself, available at [2]. the
code is available under the MIT license.

one implementation difference is that we use floats rather than doubles
since we only intend to use this for suggestion matching in unknown option
error messages and we don't need the extra precision.

[1]: https://github.com/dguo/strsim-rs
[2]: rapidfuzz/strsim-rs#31
@maxbachmann
Copy link
Member

It has been quite a while since back then and a few things have changed in the meantime. One of the major changes is that going forward we will focus on implementing algorithms with reasonable performance, while keeping the binary size small. People who care about the maximum performance will be recommended to use https://github.com/rapidfuzz/rapidfuzz-rs which focuses more on performance, but has a larger binary size.

From a quick look over the changes they can be grouped into a couple of buckets:

  • update formatting: we do now use cargo fmt to keep consistent formatting

  • move tests: in rust it appears to be common to have the tests next to the code, while only placing integration tests outside. We will need to look into where we want which parts in the future

  • improve tests: We certainly need to improve them. Currently the test output is not particularly helpful

  • improve algorithms:

    • avoid calculating lengths multiple times: that probably makes sense and is already done e.g. in reimplement damerau levenshtein distance #61
    • remove common prefix: This is generally an improvement and for this reason is done in https://github.com/rapidfuzz/rapidfuzz-rs. In fact it is often possible to remove both common prefix and postfix. However this is likely nothing we want here, since it can affect the binary size quite a bit.
    • use inclusive ranges: I opened evaluate inclusive ranges #59 for this, since while this improves readability a bit, it can negatively affect the binary size.
  • some documentation improvements like mentioning that the algorithms are not based on grapheme clusters. This could be extended to mention how to match using grapheme cluster which should be possible when using the generic versions of the metrics

I agree with @dguo that if we want to have even part of this in, we would need to split this up into multiple separate PRs. In the current form it's simply not possible to review this properly.

@jnqnfe are you still interested in working on these things?

@jnqnfe
Copy link
Author

jnqnfe commented Jan 5, 2024

Currently I have far too much on my plate to do any further work on this and I don't see that changing any time soon, sorry. It's taken me three days to even find time to provide this reply.

@maxbachmann
Copy link
Member

That's completely understandable. I just wanted to give you the option to continue working on this, since I know you did put quite a bit of work into this back then.

I did look through the changes a bit more:

  • The algorithmic improvements are probably nothing we would want. Not because they are useless, but because the project does focus on binary size nowadays and they just do not provide enough of an improvement to warrant the binary size increase.
  • The tests move is nothing we would want. I opened improve error message on test failure #67 which just improves the failure message, since the current one is not really helpful
  • the benchmark cleanup is something we would want to change. I opened remove superfluous module around benchmarks #66 just cleaning up the unnecessary bits. I will probably change this to use criterion soon though.
  • the documentation will need to be tackled separately, since we certainly need to improve it.

Not sure I missed anything small, but finding it would require this to be rebased which would be a significant amount of work. For this reason I will close this PR.

Thanks again for the work you did put into this. I know the review has taken us quite a while 😅

@maxbachmann maxbachmann closed this Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants