Tokenizer special cases do not work around infix punctuation #5598

cassidylaidlaw · 2020-06-16T18:13:03Z

How to reproduce the behaviour

I would expect the two sentences below to be tokenized the same way. However, in the second, the special cases for "won't" and "can't" do not work.

>>> import en_core_web_sm
>>> nlp = en_core_web_sm.load()
>>> [token.text for token in nlp("I can't / won't tolerate that.")]
['I', 'ca', "n't", '/', 'wo', "n't", 'tolerate', 'that', '.']
>>> [token.text for token in nlp("I can't/won't tolerate that.")] 
['I', "can't", '/', "won't", 'tolerate', 'that', '.']

Your Environment

spaCy version: 2.3.0
Platform: Darwin-18.7.0-x86_64-i386-64bit
Python version: 3.7.4

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-06-19T07:56:15Z

Thanks for the report!

There are number of changes coming soon for spacy v3 that make the tokenizer more consistent, in particular for special cases that contain prefix/suffix/infix punctuation that don't work consistently in v2, but special cases for the parts split by the infixes isn't one of them.

Checking for special cases around infixes is relatively simple to add, but I'd need to check whether it slows the tokenizer down too much. If it is a lot slower, I think we can consider adding an option that enables more thorough special case handling, which would be off by default.

adrianeboyd · 2020-09-10T16:49:47Z

It turned out that this has too much of an effect on the existing tokenizer settings, which were designed without this infix special case checking. It might be possible to add an option to the tokenizer to allow this, but we're wary of adding even more options, so for now we're going to put the idea on hold.

veonua · 2021-03-06T23:30:10Z

for some reason, the tokenizer is built around the spaces so it's unable to split strings without issues,
I run into the issue with the string
"($10/$20)" by default it's
(, $, 10/$20, )

with infixes it's
(, $, 10, /, $20, )

just because infixes are not powerful enough to run suffix\prefix behavior

svlandeg added feat / tokenizer Feature: Tokenizer lang / en English language data and models labels Jun 16, 2020

adrianeboyd mentioned this issue Jul 17, 2020

Apply special cases to tokens split by infixes #5772

Closed

3 tasks

svlandeg linked a pull request Jul 17, 2020 that will close this issue

Apply special cases to tokens split by infixes #5772

Closed

3 tasks

svlandeg added the enhancement Feature requests and improvements label Dec 22, 2020

polm mentioned this issue Jan 19, 2022

Contractions incorrectly tokenized when part of an infix substring #10086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer special cases do not work around infix punctuation #5598

Tokenizer special cases do not work around infix punctuation #5598

cassidylaidlaw commented Jun 16, 2020

adrianeboyd commented Jun 19, 2020

adrianeboyd commented Sep 10, 2020

veonua commented Mar 6, 2021

Tokenizer special cases do not work around infix punctuation #5598

Tokenizer special cases do not work around infix punctuation #5598

Comments

cassidylaidlaw commented Jun 16, 2020

How to reproduce the behaviour

Your Environment

adrianeboyd commented Jun 19, 2020

adrianeboyd commented Sep 10, 2020

veonua commented Mar 6, 2021