Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer special cases do not work around infix punctuation #5598

Open
cassidylaidlaw opened this issue Jun 16, 2020 · 3 comments
Open

Tokenizer special cases do not work around infix punctuation #5598

cassidylaidlaw opened this issue Jun 16, 2020 · 3 comments
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / en English language data and models

Comments

@cassidylaidlaw
Copy link

How to reproduce the behaviour

I would expect the two sentences below to be tokenized the same way. However, in the second, the special cases for "won't" and "can't" do not work.

>>> import en_core_web_sm
>>> nlp = en_core_web_sm.load()
>>> [token.text for token in nlp("I can't / won't tolerate that.")]
['I', 'ca', "n't", '/', 'wo', "n't", 'tolerate', 'that', '.']
>>> [token.text for token in nlp("I can't/won't tolerate that.")] 
['I', "can't", '/', "won't", 'tolerate', 'that', '.']

Your Environment

  • spaCy version: 2.3.0
  • Platform: Darwin-18.7.0-x86_64-i386-64bit
  • Python version: 3.7.4
@svlandeg svlandeg added feat / tokenizer Feature: Tokenizer lang / en English language data and models labels Jun 16, 2020
@adrianeboyd
Copy link
Contributor

Thanks for the report!

There are number of changes coming soon for spacy v3 that make the tokenizer more consistent, in particular for special cases that contain prefix/suffix/infix punctuation that don't work consistently in v2, but special cases for the parts split by the infixes isn't one of them.

Checking for special cases around infixes is relatively simple to add, but I'd need to check whether it slows the tokenizer down too much. If it is a lot slower, I think we can consider adding an option that enables more thorough special case handling, which would be off by default.

@adrianeboyd
Copy link
Contributor

It turned out that this has too much of an effect on the existing tokenizer settings, which were designed without this infix special case checking. It might be possible to add an option to the tokenizer to allow this, but we're wary of adding even more options, so for now we're going to put the idea on hold.

@svlandeg svlandeg added the enhancement Feature requests and improvements label Dec 22, 2020
@veonua
Copy link

veonua commented Mar 6, 2021

for some reason, the tokenizer is built around the spaces so it's unable to split strings without issues,
I run into the issue with the string
"($10/$20)" by default it's
(, $, 10/$20, )

with infixes it's
(, $, 10, /, $20, )

just because infixes are not powerful enough to run suffix\prefix behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / en English language data and models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants