- Decrease time it takes to load the
Tokenizer
by ~ 40% (#70). - Tag lookup is backed by a vector instead of a hashmap now.
- The tagger now returns iterators over tags instead of allocating a vector.
- Remove
get_group_members
function.
- Fix a bug where calling
Rule::suggest
in parallel across threads would cause a panic (#68, thanks @drahnr!)
Speed up loading the Tokenizer
by ~ 25% (#66).
- Build Python wheels in container for full manylinux2014 compliance, now works for glibc 2.17 and above (thanks @dvwright!)
- Speed up loading the
Tokenizer
by avoiding an allocation (thanks @drahnr!)
- Fix a significant bug where text with multiple sentences would sometimes cause an error if one of the latter sentences matches some pattern (#61, #63, thanks @drahnr!).
- Remove
multiword_tags
on tokens (now part of the regular tags). - Make fields of the
Word
private and add getter methods. Word
constructor is now callednew
instead ofnew_with_tags
.
- Adds
as_str
convenience method to multiple structs (WordId
,PosId
,Word
).
- Restore
FromIterator
andIntoIterator
impl onRules
(#58, thanks @drahnr!) - Add
Clone
derives onTokenizer
andRules
(and, accordingly, on their fields)
- Changes the focus from
Vec<Token>
toSentence
(#54).pipe
andsentencize
return iterators overSentence
/IncompleteSentence
now. - Removes the special
SENT_START
token (now only used internally). Each token corresponds to at least one character in the input text now. - Makes the fields of
Token
andIncompleteToken
private and adds getter methods (#54). char_span
andbyte_span
are replaced by aSpan
struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, usetoken.span().byte()
.- Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).
- The regex backend can now be chosen from Oniguruma or fancy-regex with the features
regex-onig
andregex-fancy
.regex-onig
is the default. - nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.
- A new selector API to select individual rules (details documented in
nlprule::rule::id
). For example:
use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;
let mut rules = Rules::new("path/to/en_rules.bin")?;
// disable rules named "confusion_due_do" in category "confused_words"
rules
.select_mut(
&Category::new("confused_words")
.join("confusion_due_do")
.into(),
)
.for_each(|rule| rule.disable());
// disable all grammar rules
rules
.select_mut(&Category::new("grammar").into())
.for_each(|rule| rule.disable());
// a string syntax where slashes are the separator is also supported
rules
.select_mut(&"confused_words/confusion_due_do".try_into()?)
.for_each(|rule| rule.enable());
.validate()
innlprule-build
now returns aResult<()>
to encourage calling it after.postprocess()
.
- Fixes an error where
Cursor
position innlprule-build
was not reset appropriately. - Use
fs_err
everywhere for better error messages.
- A
transform
function innlprule-build
to transform binaries immediately after acquiring them. Suited for e. g. compressing the binaries before caching them.
- Require
srx=^0.1.2
to include a patch for out of bounds access.
This is a patch release but there are some small breaking changes to the public API:
from_reader
andnew
methods of theTokenizer
andRules
now return annlprule::Error
instead ofbincode:Error
.tag_store
andword_store
methods of theTagger
are now private.
- The
nlprule-build
crate now has apostprocess
method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).
- Newtypes for
PosIdInt
andWordIdInt
to clarify use of ids in the tagger (#31). - Newtype for indices into the match graph (
GraphId
). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31). - Reduced size of the English tokenizer through better serialization of the chunker. From 15MB (7.7MB gzipped) to 11MB (6.9MB gzipped).
- Reduce allocations through making more use of iterators internally (#30). Improves speed but there is no significant benchmark improvement on my machine.
- Improve error handling by propagating more errors in the
compile
module instead of panicking and better build-time validation. Reducesunwrap
s from ~80 to ~40.
nlprule
does sentence segmentation internally now using srx. The Python API has changed, removing theSplitOn
class and the*_sentence
methods:
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)
rules.correct("He wants that you send him an email.") # this takes an arbitrary text
new_from
is now calledfrom_reader
in the Rust API (thanks @drahnr!)Token.text
andIncompleteToken.text
are now calledToken.sentence
/IncompleteToken.sentence
to avoid confusion withToken.word.text
.Tokenizer.tokenize
is now private. UseTokenizer.pipe
instead (also does sentence segmentation).
- Support for Spanish (experimental).
- A new multiword tagger improves tagging of e. g. named entities for English and Spanish.
- Adds the
nlprule-build
crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!) - Scripts and docs in
build/README.md
to make creating the nlprule build directories easier and more reproducible. - Full support for LanguageTool unifications.
- Binary size of the
Tokenizer
improved a lot. Now roughly x6 smaller for German and x2 smaller for English. - New iterator helpers for
Rules
(thanks @drahnr!) - A method
.sentencize
on theTokenizer
which does only sentence segmentation and nothing else.
BREAKING: suggestion.text
is now more accurately called suggestion.replacements
-
Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.
-
Rules have more information in the public API now: See #5
- Python 3.9 support (fixes #7)
- Fix precedence of Rule IDs over Rule Group IDs.
- Updated to LT version 5.2.
- Suggestions now have a
message
andsource
attribute (#5):
suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
print(s.start, s.end, s.text, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?
- NLPRule is parallelized by default now. Parallelism can be turned off by setting the
NLPRULE_PARALLELISM
environment variable to false.