-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularizing the crate #50
Comments
One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule. |
I like the writeup and the direction this points! If some crates don't make sense without others, re-export behind a feature flag. Feel free to ping me for feedback on such a PR :) |
I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines |
Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well. I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking). In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the let correcter = Pipeline::new((
tokenizer!("en"),
Union::new((rules!("en"), spell!("en", "en_GB"))),
)); where the Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:
let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
env!("OUT_DIR"),
"/",
tokenizer_filename!("en")
)); for one binary is way too unergonomic.
so to fix this I want to deprecate The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution. I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related. Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn. |
As suggested originally by @drahnr (#2 (comment)) we should consider splitting nlprule into multiple sub-crates.
What's certain is that there will be one
nlprule
crate combining sub-crates into one higher level API which is equivalent to the Python API.There are multiple reasons for doing a split:
Modularizing the crate would primarily benefit size of the binaries and needed dependencies.
There is a distinction between:
Tokenizer
)Splitting (3.) into modules is easy: there should be one module for each separate entity which operators on the tokens i. e. one for spellchecking (
nlprule-spellcheck
) and one for grammar rules (nlprule-grammar
or something similar).Splitting (2.) and (1.) is harder. I think having (1.) as a separate crate which only does sentence segmentation and token segmentation (
nlprule-tokenize
) would make sense. Then there could be multiple crates which set different information on the tokens for examplenlprule-disambiguate
(disambiguation rules),nlprule-multiword
(multiword tagging) andnlprule-crf
(chunking and possibly other CRF models behind feature flags).There's a number of open issues like:
nlprule-build
might be an option.nlprule-grammar
does not make sense withoutnlprule-disambiguate
. Panicking as early as possible is probably best.nlprule
still makes sense since it is possible to combine modules in a way that does not use rules at all. But that's not the biggest issue 🙂Implementing this is not a near-term goal right now (see #42) and I am not sure whether it is worth it from a practical point of view but I wanted to open this issue for discussion. Also thanks @drahnr for the original suggestion.
The text was updated successfully, but these errors were encountered: