Parser doesn't respect preset sentence boundaries in some cases #7716

polm · 2021-04-09T09:27:49Z

This is for the issue found in #7564.

How to reproduce the behaviour

Given a sentence, set is_sent_start to False in some but not all of the tokens before it gets to the parser.

Example sentence:

According to the sections (1) and (2) of this page.

Example settings:

According True
to None
the None
sections None
( True
1 None
) None
and False
( False
2 None
) None
of False
this None
page None
. None

In this case the parser will contradict the False settings. However, if you move them one down to the next token, they will be respected. This seems like an off-by-one error in the parser somewhere.

Your Environment

Operating System: Linux
Python Version Used:
spaCy Version Used: 3.0.5
Environment Information:

The text was updated successfully, but these errors were encountered:

polm · 2021-08-17T10:44:03Z

Minimal reproduction code for this issue:

import spacy
from spacy.language import Language


@Language.component("special_split")
def special_split(doc):
    # Note this code is not robust with handling indices
    for ii, tok in enumerate(doc):
        if ii == 0:
            tok.is_sent_start = True
        if tok.text == "(":
            tok.is_sent_start = True
        if ii + 1 < len(doc) and doc[ii+1].text == "2":
            tok.is_sent_start = False
        if tok.text == ")":
            doc[ii+1].is_sent_start = False
        print(tok, tok.is_sent_start, sep="\t")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("special_split", before="parser")

text = "According to the sections (1) and (2) of this page."

doc = nlp(text)
for sent in doc.sents:
    print(sent)

Wrong output, because it overrules the user is_sent_start:

According to the sections
(1) and
(2) of this page.

Desired output (other valid outputs possible):

According to the sections
(1) and (2) of this page.

However, under v3.1.1 the output is the desired output above, not the bad output. This could happen due to small changes in the model causing the model predictions to align with the user settings by happy accident. I'll see if I can reproduce the bug by modifying the sentence a bit.

polm · 2021-08-18T04:55:15Z

Using the 3.0 model with 3.1 the wrong sentence splits still show up so it looks like the "fix" with 3.1 is a happy accident. Also note these tests are all with the small English model. I have not been able to make other sentences that show this issue.

There aren't many places where the sent starts are set, so this isn't very specific information, but I was able to track down the problematic change to set_children_from_heads. It seems that when the parser is making sure that heads are in the right sentences it doesn't pay any attention to existing sentence boundaries. I am not sure where in the parser it makes sure that single sentences don't have multiple heads but I guess it's missing something there.

scott-greenberg-vitalsource · 2021-10-19T19:05:50Z

I have a case where this happens for en_core_web_lg, this is spacy ver.3.1.3 and model ver.3.1.0:

from spacy.language import Language
import spacy

nlp = spacy.load('en_core_web_lg')
text = """
Early sociologists include Jim Henson, Sarah Smith, Albert Johnson, Nelson Mandela, Mahatma Ghandi, Samsonite Jones, and W. E. B. Du Bois. Pp. 10–111.
"""

print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))


@Language.component( '_custom_sbd' )
def _custom_sbd( doc ):
    for tok in doc:
        # tok.is_sent_start = True
        if 'W.' in tok.text:
            print(tok)
            tok.is_sent_start = False
    return doc


nlp.add_pipe('_custom_sbd', before='parser')
print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))

you can follow a debugger/trace into Language.call() and watch where the custom_sbd sets the sentence boundary to False, followed by parser setting it back to True.

polm · 2021-10-20T04:47:25Z

Thanks for the report and debugging info!

scott-greenberg-vitalsource · 2021-10-20T19:28:07Z

@polm no problem! If there's any more info I can provide that would be helpful, please let me know.

itssimon · 2022-11-08T11:36:36Z

I'm also hitting this issue after updating spaCy from v2 to v3. Are there any workarounds until it is fixed in the parser? Unfortunately it's a major issue in our use case.

polm · 2022-11-09T04:25:58Z

Sorry this is a major issue for you. Just to check something, this is happening because you're using partial sentence annotations? If you have some examples we could check, especially of short sentences, it might be helpful, just in case it brings up a pattern we haven't seen before. It would also be helpful to know why you need this - for example, do you also have legal text with unusual formatting, like the original reporter, or is it something else?

As far as workarounds, a couple come to mind, though none are great.

can you set all sentence boundaries instead of some?
rather than setting boundaries before the parser runs, set them afterwards
try setting both before and after the parser runs, which might do a better job directing the parser

polm added bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser labels Apr 9, 2021

polm mentioned this issue Oct 9, 2021

Add test for case where parser overwrite annotations #9406

Merged

3 tasks

danieldk mentioned this issue Feb 25, 2024

Spacy has inconsistency when dividing sentences #13346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser doesn't respect preset sentence boundaries in some cases #7716

Parser doesn't respect preset sentence boundaries in some cases #7716

polm commented Apr 9, 2021

polm commented Aug 17, 2021

polm commented Aug 18, 2021

scott-greenberg-vitalsource commented Oct 19, 2021 •

edited

Loading

polm commented Oct 20, 2021

scott-greenberg-vitalsource commented Oct 20, 2021

itssimon commented Nov 8, 2022

polm commented Nov 9, 2022

Parser doesn't respect preset sentence boundaries in some cases #7716

Parser doesn't respect preset sentence boundaries in some cases #7716

Comments

polm commented Apr 9, 2021

How to reproduce the behaviour

Your Environment

polm commented Aug 17, 2021

polm commented Aug 18, 2021

scott-greenberg-vitalsource commented Oct 19, 2021 • edited Loading

polm commented Oct 20, 2021

scott-greenberg-vitalsource commented Oct 20, 2021

itssimon commented Nov 8, 2022

polm commented Nov 9, 2022

scott-greenberg-vitalsource commented Oct 19, 2021 •

edited

Loading