You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
en_core_web_trf (3.8.0) labels CARDINAL tokens as ORG. This happens in the affiliation sections for many scientific manuscript I tried out. Interestingly, only the transformer pipeline has this new unexpected behavior, NER from en_core_web_lg (3.8.0) works as expected.
How to reproduce the behaviour
frompprintimportpprintimportspacytext="Kelly E. Williams 1,2,3* , Kathryn P. Huyvaert 2 , Kurt C. Vercauteren 1 , Amy J. Davis 1 , Antoinette J. Piaggio 1\n1 USDA, Wildlife Services, National Wildlife Research Center, Wildlife Genetics Lab, 4101 Laporte Avenue, Fort Collins, CO, USA\n2 Department of Fish, Wildlife, and Conservation Biology, Colorado State University, Fort Collins, CO, 80523, USA\n3 School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA"nlp=spacy.load("en_core_web_trf")
doc=nlp(text)
pprint([entforentindoc.entsifent.label_=="ORG"])
spaCy version 3.6.1 ( en_core_web_trf (3.6.1) ) returns:
[USDA,
Wildlife Services,
National Wildlife Research Center,
Wildlife Genetics Lab,
Department of Fish, Wildlife, and Conservation Biology,
Colorado State University,
School of Environmental and Forest Sciences,
University of Washington]
spaCy version 3.8.4 (en_core_web_trf (3.8.0)) returns:
[USDA,
Wildlife Services,
National Wildlife Research Center,
Wildlife Genetics Lab,
USA
,
2 Department of Fish, Wildlife,, <=== ORG instead ORDINAL for "2"
Colorado State University,
USA
,
3 School of Environmental and Forest Sciences, <=== ORG instead ORDINAL for "3"
University of Washington]
["'1 USDA'", <-----
"'Wildlife Services'",
"'National Wildlife Research Center'",
"'Wildlife Genetics Lab'",
"'USA\n'",
"'2 Department of Fish, Wildlife,'", <-----
"'Colorado State University'",
"'USA\n'",
"'3 School of Environmental and Forest Sciences'", <-----
"'University of Washington'"]
en_core_web_trf (3.8.0) labels CARDINAL tokens as ORG. This happens in the affiliation sections for many scientific manuscript I tried out. Interestingly, only the transformer pipeline has this new unexpected behavior, NER from en_core_web_lg (3.8.0) works as expected.
How to reproduce the behaviour
spaCy version 3.6.1 ( en_core_web_trf (3.6.1) ) returns:
spaCy version 3.8.4 (en_core_web_trf (3.8.0)) returns:
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: