Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser doesn't respect preset sentence boundaries in some cases #7716

Open
polm opened this issue Apr 9, 2021 · 7 comments
Open

Parser doesn't respect preset sentence boundaries in some cases #7716

polm opened this issue Apr 9, 2021 · 7 comments
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser

Comments

@polm
Copy link
Contributor

polm commented Apr 9, 2021

This is for the issue found in #7564.

How to reproduce the behaviour

Given a sentence, set is_sent_start to False in some but not all of the tokens before it gets to the parser.

Example sentence:

According to the sections (1) and (2) of this page.

Example settings:

According True
to None
the None
sections None
( True
1 None
) None
and False
( False
2 None
) None
of False
this None
page None
. None

In this case the parser will contradict the False settings. However, if you move them one down to the next token, they will be respected. This seems like an off-by-one error in the parser somewhere.

Your Environment

  • Operating System: Linux
  • Python Version Used:
  • spaCy Version Used: 3.0.5
  • Environment Information:
@polm polm added bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser labels Apr 9, 2021
@polm
Copy link
Contributor Author

polm commented Aug 17, 2021

Minimal reproduction code for this issue:

import spacy
from spacy.language import Language


@Language.component("special_split")
def special_split(doc):
    # Note this code is not robust with handling indices
    for ii, tok in enumerate(doc):
        if ii == 0:
            tok.is_sent_start = True
        if tok.text == "(":
            tok.is_sent_start = True
        if ii + 1 < len(doc) and doc[ii+1].text == "2":
            tok.is_sent_start = False
        if tok.text == ")":
            doc[ii+1].is_sent_start = False
        print(tok, tok.is_sent_start, sep="\t")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("special_split", before="parser")

text = "According to the sections (1) and (2) of this page."

doc = nlp(text)
for sent in doc.sents:
    print(sent)

Wrong output, because it overrules the user is_sent_start:

According to the sections
(1) and
(2) of this page.

Desired output (other valid outputs possible):

According to the sections
(1) and (2) of this page.

However, under v3.1.1 the output is the desired output above, not the bad output. This could happen due to small changes in the model causing the model predictions to align with the user settings by happy accident. I'll see if I can reproduce the bug by modifying the sentence a bit.

@polm
Copy link
Contributor Author

polm commented Aug 18, 2021

Using the 3.0 model with 3.1 the wrong sentence splits still show up so it looks like the "fix" with 3.1 is a happy accident. Also note these tests are all with the small English model. I have not been able to make other sentences that show this issue.

There aren't many places where the sent starts are set, so this isn't very specific information, but I was able to track down the problematic change to set_children_from_heads. It seems that when the parser is making sure that heads are in the right sentences it doesn't pay any attention to existing sentence boundaries. I am not sure where in the parser it makes sure that single sentences don't have multiple heads but I guess it's missing something there.

@scott-greenberg-vitalsource
Copy link

scott-greenberg-vitalsource commented Oct 19, 2021

I have a case where this happens for en_core_web_lg, this is spacy ver.3.1.3 and model ver.3.1.0:

from spacy.language import Language
import spacy

nlp = spacy.load('en_core_web_lg')
text = """
Early sociologists include Jim Henson, Sarah Smith, Albert Johnson, Nelson Mandela, Mahatma Ghandi, Samsonite Jones, and W. E. B. Du Bois. Pp. 10–111.
"""

print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))


@Language.component( '_custom_sbd' )
def _custom_sbd( doc ):
    for tok in doc:
        # tok.is_sent_start = True
        if 'W.' in tok.text:
            print(tok)
            tok.is_sent_start = False
    return doc


nlp.add_pipe('_custom_sbd', before='parser')
print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))

you can follow a debugger/trace into Language.call() and watch where the custom_sbd sets the sentence boundary to False, followed by parser setting it back to True.

@polm
Copy link
Contributor Author

polm commented Oct 20, 2021

Thanks for the report and debugging info!

@scott-greenberg-vitalsource

@polm no problem! If there's any more info I can provide that would be helpful, please let me know.

@itssimon
Copy link
Contributor

itssimon commented Nov 8, 2022

I'm also hitting this issue after updating spaCy from v2 to v3. Are there any workarounds until it is fixed in the parser? Unfortunately it's a major issue in our use case.

@polm
Copy link
Contributor Author

polm commented Nov 9, 2022

Sorry this is a major issue for you. Just to check something, this is happening because you're using partial sentence annotations? If you have some examples we could check, especially of short sentences, it might be helpful, just in case it brings up a pattern we haven't seen before. It would also be helpful to know why you need this - for example, do you also have legal text with unusual formatting, like the original reporter, or is it something else?

As far as workarounds, a couple come to mind, though none are great.

  • can you set all sentence boundaries instead of some?
  • rather than setting boundaries before the parser runs, set them afterwards
  • try setting both before and after the parser runs, which might do a better job directing the parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser
Projects
None yet
Development

No branches or pull requests

3 participants