Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

en_core_web_trf (3.8.0) ORG predictions seem inaccurate compared to en_core_web_trf (3.6.1) #13734

Open
vitaly-d opened this issue Jan 25, 2025 · 1 comment

Comments

@vitaly-d
Copy link

en_core_web_trf (3.8.0) labels CARDINAL tokens as ORG. This happens in the affiliation sections for many scientific manuscript I tried out. Interestingly, only the transformer pipeline has this new unexpected behavior, NER from en_core_web_lg (3.8.0) works as expected.

How to reproduce the behaviour

from pprint import pprint

import spacy

text = "Kelly E. Williams 1,2,3* , Kathryn P. Huyvaert 2 , Kurt C. Vercauteren 1 , Amy J. Davis 1 , Antoinette J. Piaggio 1\n1 USDA, Wildlife Services, National Wildlife Research Center, Wildlife Genetics Lab, 4101 Laporte Avenue, Fort Collins, CO, USA\n2 Department of Fish, Wildlife, and Conservation Biology, Colorado State University, Fort Collins, CO, 80523, USA\n3 School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA"
nlp = spacy.load("en_core_web_trf")
doc = nlp(text)
pprint([ent for ent in doc.ents if ent.label_ == "ORG"])

spaCy version 3.6.1 ( en_core_web_trf (3.6.1) ) returns:

[USDA,
 Wildlife Services,
 National Wildlife Research Center,
 Wildlife Genetics Lab,
 Department of Fish, Wildlife, and Conservation Biology,
 Colorado State University,
 School of Environmental and Forest Sciences,
 University of Washington]

spaCy version 3.8.4 (en_core_web_trf (3.8.0)) returns:

[USDA,
 Wildlife Services,
 National Wildlife Research Center,
 Wildlife Genetics Lab,
 USA
,
 2 Department of Fish, Wildlife,,       <=== ORG instead ORDINAL for "2"
 Colorado State University,
 USA
,
 3 School of Environmental and Forest Sciences,       <=== ORG instead ORDINAL for "3"
 University of Washington]

Your Environment

Info about spaCy

  • spaCy version: 3.8.4
  • Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
  • Platform: macOS-15.2-arm64-arm-64bit
  • Python version: 3.11.11
  • Pipelines: en_core_web_trf (3.8.0)
@vitaly-d
Copy link
Author

UPD:
the same issue for en_core_web_trf (3.7.3) :

Info about spaCy

  • spaCy version: 3.7.6
  • Platform: macOS-15.2-arm64-arm-64bit
  • Python version: 3.11.11
  • Pipelines: en_core_web_trf (3.7.3)
["'1 USDA'",                                            <-----
 "'Wildlife Services'",
 "'National Wildlife Research Center'",
 "'Wildlife Genetics Lab'",
 "'USA\n'",
 "'2 Department of Fish, Wildlife,'",                   <-----
 "'Colorado State University'",
 "'USA\n'",
 "'3 School of Environmental and Forest Sciences'",     <-----
 "'University of Washington'"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant