Suffix doesn't match for sentence ending in uppercase. #6695

jdupl123 · 2021-01-08T04:15:47Z

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_sm")
list(nlp.tokenizer("about the P&L."))

I get

[about, the, P&L.]

The . should be separated from P&L here.

This behaviour comes from,

spaCy/spacy/lang/punctuation.py

Line 33 in bf778f5

r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),

the requirement for double uppercase is likely for acronyms but perhaps an ampersand is acceptable.

eg r"(?<=&[{au}])\.".format(au=ALPHA_UPPER)

Your Environment

spaCy version: 2.3.2
Platform: Darwin-19.6.0-x86_64-i386-64bit
Python version: 3.6.12

The text was updated successfully, but these errors were encountered:

svlandeg · 2021-01-11T19:33:21Z

Yes, I see your point. I think you'd want to add the rule

r"(?<=[{au}]&[{au}])\.".format(au=ALPHA_UPPER),

to the _suffixes. Unfortunately you can't just put an optional & in the existing rule, because the look-behind can't be variable-width.

If you're training a custom model, you could modify this behaviour for your own custom tokenizer, cf https://spacy.io/usage/linguistic-features#native-tokenizer-additions. You could also replace the tokenizer of a pretrained model with your own custom tokenizer, though that may impact accuracy slightly (though maybe not so much in this case).

We're typically hesitant to change the punctuation rules in the core library though, because there may be unwanted side effects, especially when changing the lang/punctuation.py file that is used as base for many other languages. On spaCy's develop branch, we have a specific punctuation file for English, https://github.com/explosion/spaCy/blob/develop/spacy/lang/en/punctuation.py, where we could consider adding this change for English only.

I've been trying to think of "bad" consequences of adding your proposed "ampersand" rule to the English tokenizer and can't immediately think of one. I'm less sure about other languages. Would be interested to hear what my colleagues think - e.g. @adrianeboyd ?

adrianeboyd · 2021-01-15T14:09:03Z

I can't think of anything major, but to be on the safe side we should test it with all the internal training corpora. Let me see...

MucAlex · 2021-01-27T13:29:26Z

I am experiencing a similar behavior with the German word "GmbH".

nlp = spacy.lang.de.German() 
[tok for tok in nlp("Herr Bert ist Geschäftsführer der Ernie GmbH.")]

Results in

[Herr, Bert, ist, Geschäftsführer, der, Ernie, GmbH.]

I followed the example above and added a specific rule to _suffixes

svlandeg added feat / tokenizer Feature: Tokenizer lang / en English language data and models labels Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suffix doesn't match for sentence ending in uppercase. #6695

Suffix doesn't match for sentence ending in uppercase. #6695

jdupl123 commented Jan 8, 2021 •

edited

Loading

svlandeg commented Jan 11, 2021

adrianeboyd commented Jan 15, 2021

MucAlex commented Jan 27, 2021 •

edited

Loading

Suffix doesn't match for sentence ending in uppercase. #6695

Suffix doesn't match for sentence ending in uppercase. #6695

Comments

jdupl123 commented Jan 8, 2021 • edited Loading

How to reproduce the behaviour

Your Environment

svlandeg commented Jan 11, 2021

adrianeboyd commented Jan 15, 2021

MucAlex commented Jan 27, 2021 • edited Loading

jdupl123 commented Jan 8, 2021 •

edited

Loading

MucAlex commented Jan 27, 2021 •

edited

Loading