-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser doesn't respect preset sentence boundaries in some cases #7716
Comments
Minimal reproduction code for this issue:
Wrong output, because it overrules the user
Desired output (other valid outputs possible):
However, under v3.1.1 the output is the desired output above, not the bad output. This could happen due to small changes in the model causing the model predictions to align with the user settings by happy accident. I'll see if I can reproduce the bug by modifying the sentence a bit. |
Using the 3.0 model with 3.1 the wrong sentence splits still show up so it looks like the "fix" with 3.1 is a happy accident. Also note these tests are all with the small English model. I have not been able to make other sentences that show this issue. There aren't many places where the sent starts are set, so this isn't very specific information, but I was able to track down the problematic change to |
I have a case where this happens for
you can follow a debugger/trace into Language.call() and watch where the |
Thanks for the report and debugging info! |
@polm no problem! If there's any more info I can provide that would be helpful, please let me know. |
I'm also hitting this issue after updating spaCy from v2 to v3. Are there any workarounds until it is fixed in the parser? Unfortunately it's a major issue in our use case. |
Sorry this is a major issue for you. Just to check something, this is happening because you're using partial sentence annotations? If you have some examples we could check, especially of short sentences, it might be helpful, just in case it brings up a pattern we haven't seen before. It would also be helpful to know why you need this - for example, do you also have legal text with unusual formatting, like the original reporter, or is it something else? As far as workarounds, a couple come to mind, though none are great.
|
This is for the issue found in #7564.
How to reproduce the behaviour
Given a sentence, set
is_sent_start
to False in some but not all of the tokens before it gets to the parser.Example sentence:
Example settings:
In this case the parser will contradict the False settings. However, if you move them one down to the next token, they will be respected. This seems like an off-by-one error in the parser somewhere.
Your Environment
The text was updated successfully, but these errors were encountered: