You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, I see your point. I think you'd want to add the rule
r"(?<=[{au}]&[{au}])\.".format(au=ALPHA_UPPER),
to the _suffixes. Unfortunately you can't just put an optional & in the existing rule, because the look-behind can't be variable-width.
If you're training a custom model, you could modify this behaviour for your own custom tokenizer, cf https://spacy.io/usage/linguistic-features#native-tokenizer-additions. You could also replace the tokenizer of a pretrained model with your own custom tokenizer, though that may impact accuracy slightly (though maybe not so much in this case).
We're typically hesitant to change the punctuation rules in the core library though, because there may be unwanted side effects, especially when changing the lang/punctuation.py file that is used as base for many other languages. On spaCy's develop branch, we have a specific punctuation file for English, https://github.com/explosion/spaCy/blob/develop/spacy/lang/en/punctuation.py, where we could consider adding this change for English only.
I've been trying to think of "bad" consequences of adding your proposed "ampersand" rule to the English tokenizer and can't immediately think of one. I'm less sure about other languages. Would be interested to hear what my colleagues think - e.g. @adrianeboyd ?
How to reproduce the behaviour
I get
The . should be separated from P&L here.
This behaviour comes from,
spaCy/spacy/lang/punctuation.py
Line 33 in bf778f5
the requirement for double uppercase is likely for acronyms but perhaps an ampersand is acceptable.
eg
r"(?<=&[{au}])\.".format(au=ALPHA_UPPER)
Your Environment
The text was updated successfully, but these errors were encountered: