Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic model: wrong sentence splitting (PADT) #1393

Open
rahonalab opened this issue May 13, 2024 · 2 comments
Open

Arabic model: wrong sentence splitting (PADT) #1393

rahonalab opened this issue May 13, 2024 · 2 comments
Labels

Comments

@rahonalab
Copy link

I am trying to parse Arabic texts using the pretrained model (PADT), but some portions of texts are recognized as a single sentence.

For example, this Arabic passage results in a single sentence:

ﻮﺒﺳﺮﻋﺓ ﺖﺒﻌﺘﻫ ﺄﻠﻴﺳ ﻮﺴﻘﻄﺗ ﻑﻯ ﻦﻔﻗ ﻁﻮﻴﻟ ﺎﻨﺘﻫﻯ ﺐﻫﺍ ﺈﻟﻯ ﺏﻻﺩ ﺎﻠﻌﺟﺎﺌﺑ, ﻭﺈﻟﻯ ﻉﺎﻠﻣ ﻢﺜﻳﺭ ﻢﻧ ﺎﻠﻤﻏﺎﻣﺭﺎﺗ. . . ﻒﻬﻳﺍ ﻦﻠﺤﻗ ﺐﻫﺍ ﻞﻨﺧﻮﺿ ﻢﻌﻫﺍ ﺖﻠﻛ ﺎﻟﺮﺤﻟﺓ ﺎﻠﻣﺪﻬﺷﺓ . ﺎﻠﻔﺼﻟ ﺍﻷﻮﻟ ﺎﻠﺴﻗﻮﻃ ﻑﻯ ﺞﺣﺭ ﺍﻷﺮﻨﺑ ﺏﺩﺃ ﺎﻠﻤﻠﻟ ﻲﺴﻴﻃﺭ ﻊﻟﻯ ﺄﻠﻴﺳ ﻮﻬﻳ ﺖﺠﻠﺳ ﺏﺎﻠﻗﺮﺑ ﻢﻧ ﺄﺨﺘﻫﺍ ﻊﻟﻯ ﺾﻓﺓ ﺎﻠﻨﻫﺭ، ﻻ ﺖﻔﻌﻟ ﺶﻴﺋًﺍ ﺱﻭﻯ ﺈﻠﻗﺍﺀ ﻦﻇﺭﺓ ﺥﺎﻄﻓﺓ ﺐﻴﻧ ﺎﻠﺤﻴﻧ ﻭﺍﻶﺧﺭ ﻊﻟﻯ ﺎﻠﻜﺗﺎﺑ ﺎﻟﺫﻯ ﺖﻃﺎﻠﻌﻫ ﺄﺨﺘﻫﺍ، ﻞﻜﻨﻫ ﻙﺎﻧ ﻚﺗﺎﺑﺍ ﺏﻻ ﺹﻭﺭ ﻮﻳﻻ ﺡﻭﺍﺭ؛ ﻒﺣﺪﺜﺗ ﻦﻔﺴﻫﺍ ﻕﺎﺌﻟﺓٌ ﻮﻣﺍ ﻑﺎﺋﺩﺓ ﻚﺗﺎﺑ ﺥﺎﻟ ﻢﻧ ﺎﻠﺻﻭﺭ ﻮﻤﻧ ﺎﻠﺣﻭﺍﺭ؟ ﻭﺄﺧﺬﺗ ﺖﻔﻛﺭ (ﻕﺩﺭ ﻡﺍ ﺎﺴﺘﻃﺎﻌﺗ؛ ﻒﺷﺩﺓ ﺎﻠﺣﺭﺍﺭﺓ ﺞﻌﻠﺘﻫﺍ ﺖﺸﻋﺭ ﺐﻨﻋﺎﺳ ﺵﺪﻳﺩ ﻮﺘﺒﻟﺩ)... ﻪﻟ ﺺﻨﻋ ﻊﻗﺩ ﻢﻧ ﺰﻫﺭﺓ ﺎﻟﺮﺒﻴﻋ ﻲﺴﺘﺤﻗ ﺎﻠﻨﻫﻮﺿ ﻮﻘﻄﻓ ﺍﻷﺰﻫﺍﺭ؟ ﻮﻔﺟﺃﺓ! ﻞﻤﺤﺗ ﺃﺮﻨﺑًﺍ ﺄﺒﻴﺿ ﻞﻫ ﻊﻴﻧﺎﻧ ﻭﺭﺪﻴﺗﺎﻧ ﻲﻣﺭ ﺏﺎﻠﻗﺮﺑ ﻢﻨﻫﺍ ﻞﻣ ﺖﺴﺘﻏﺮﺑ ﺄﻠﻴﺳ ﻝﺬﻠﻛ ﻭﻻ ﻞﺴﻣﺎﻋ ﺍﻷﺮﻨﺑ ﻮﻫﻯ ﻲﺣﺪﺛ ﻦﻔﺴﻫ ﻕﺎﺋﻻ ﻱﺍ ﺈﻠﻫﻯ! ﻱﺍ ﺈﻠﻫﻯ! ﺱﻮﻓ ﺄﺗﺄﺧﺭ (ﻮﺤﻴﻧ ﻒﻛﺮﺗ ﻑﻯ ﺬﻠﻛ ﻒﻴﻣﺍ ﺐﻋﺩ ﺦﻃﺭ ﻞﻫﺍ ﺄﻨﻫ ﻙﺎﻧ ﻊﻠﻴﻫﺍ ﺄﻧ ﺖﺴﺘﻏﺮﺑ ﺍﻸﻣﺭ، ﻞﻜﻧ ﻚﻟ ﺬﻠﻛ ﺏﺩﺍ ﻂﺒﻴﻌﻳﺍ ﺝﺩﺍ ﺂﻧﺫﺎﻛ) ﻮﻠﻜﻧ ﻊﻧﺪﻣﺍ ﺄﺧﺮﺟ ﺍﻷﺮﻨﺑ ﺱﺎﻋﺓ ﻢﻧ ﺞﻴﺑ ﺹﺩﺍﺮﻫ ﻮﻨﻇﺭ ﻒﻴﻫﺍ ﺚﻣ ﻢﺿﻯ ﻢﺳﺮﻋﺍ ﻮﻘﻔﺗ ﺄﻠﻴﺳ ﻑﻯ ﺎﻧﺪﻫﺎﺷ؛ ﺇﺫ ﺦﻃﺭ ﻞﻫﺍ ﺄﻨﻫﺍ ﻞﻣ ﺖﺷﺎﻫﺩ ﻖﻃ ﺃﺮﻨﺑﺍ ﻝﺪﻴﻫ ﺞﻴﺑ ﺹﺩﺍﺭ ﻭﻻ ﺱﺎﻋﺓ ﻲﺧﺮﺠﻫﺍ ﻢﻧ ﺬﻠﻛ ﺎﻠﺠﻴﺑ ﻮﻤﻧ ﺵﺩﺓ ﻒﺿﻮﻠﻫﺍ ﺝﺮﺗ ﻊﺑﺭ ﺎﻠﺤﻘﻟ ﻢﺘﺘﺒﻋﺓ ﺍﻷﺮﻨﺑ ﻮﻠﺤﺴﻧ ﺢﻈﻫﺍ ﻞﺤﻘﺗ ﺐﻫ ﻮﻫﻭ ﻲﺨﺘﻓﻯ ﺐﺳﺮﻋﺓ ﻑﻯ ﺞﺣﺭ ﻚﺒﻳﺭ ﺖﺤﺗ ﺎﻠﺳﻭﺭ. ﺎﻧﺰﻠﻘﺗ ﺄﻠﻴﺳ ﻭﺭﺍﺀﻩ ﺩﻮﻧ ﺄﻧ ﺖﺗﻮﻘﻓ ﻞﺤﻇﺓ ﻞﺘﻔﻛﺭ ﻚﻴﻓ ﺲﺘﺘﻤﻜﻧ ﻢﻧ ﺎﻠﺧﺭﻮﺟ ﺐﻋﺩ ﺬﻠﻛ. ﺎﻤﺗﺩ ﺞﺣﺭ ﺍﻷﺮﻨﺑ ﻢﺜﻟ ﺎﻠﻨﻔﻗ ﻞﻤﺳﺎﻓﺓ ﻖﺼﻳﺭﺓ ﺚﻣ ﺎﻨﺣﺩﺭ ﻒﺟﺃﺓ, ﻮﻠﻣ ﻲﻜﻧ ﻝﺩﻯ ﺄﻠﻴﺳ ﺄﻳﺓ ﻑﺮﺻﺓ ﻞﺘﻤﻨﻋ ﻦﻔﺴﻫﺍ ﻢﻧ ﺎﻠﺴﻗﻮﻃ ﻑﻯ ﺐﺋﺭ ﻊﻤﻴﻗﺓ ﺝﺩﺍ. ﻭﺎﻠﺒﺋﺭ ﻙﺎﻨﺗ ﺈﻣﺍ ﻊﻤﻴﻗﺓ ﺝﺩﺍ، ﺃﻯ ﺄﻧ ﺄﻠﻴﺳ ﺲﻘﻄﺗ ﺐﺒﻃﺀ ﺵﺪﻳﺩ، ﻒﻗﺩ ﻙﺎﻧ ﻝﺪﻴﻫﺍ ﻢﺘﺴﻋ ﻢﻧ ﺎﻟﻮﻘﺗ ﻞﺘﻨﻇﺭ ﻢﻧ ﺡﻮﻠﻫﺍ ﻮﻫﻯ ﺖﺴﻘﻃ، ﻮﻠﺘﺘﺳﺍﺀﻝ ﻊﻣﺍ ﺲﻴﺣﺪﺛ ﻒﻴﻣﺍ ﺐﻋﺩ. ﻑﻯ ﺎﻠﺑﺩﺎﻳﺓ ﺡﺍﻮﻠﺗ ﺄﻧ ﺖﻨﻇﺭ ﺈﻟﻯ ﺍﻸﺴﻔﻟ ﻞﺘﺘﺒﻴﻧ ﻡﺍ ﻲﻨﺘﻇﺮﻫﺍ، ﻮﻠﻜﻧ ﺎﻠﻇﻼﻣ ﻙﺎﻧ ﺡﺎﻠﻛﺍ ﻮﻠﻣ ﺖﺴﺘﻄﻋ ﺄﻧ ﺕﺭﻯ ﺶﻴﺋﺍ، ﺚﻣ ﻦﻇﺮﺗ ﺈﻟﻯ ﺝﻭﺎﻨﺑ ﺎﻠﺒﺋﺭ، ﻭﻼﺤﻈﺗ ﺄﻨﻫﺍ ﺕﺯﺪﺤﻣ ﺏﺎﻟﺩﻭﺎﻠﻴﺑ ﻭﺮﻓﻮﻓ ﺎﻠﻜﺘﺑ ﻒﺷﺎﻫﺪﺗ ﺥﺭﺎﺌﻃ ﻮﺻﻭﺭ ﻢﻌﻠﻗﺓ ﺐﻣﻼﻘﻃ ﻎﺴﻴﻟ ﻪﻧﺍ ﻮﻬﻧﺎﻛ. ﺝﺬﺒﺗ ﺄﻠﻴﺳ ﺏﺮﻄﻣﺎﻧًﺍ ﻢﻧ ﺄﺣﺩ ﺎﻟﺮﻓﻮﻓ ﻮﻫﻯ ﺖﻣﺭ ﺐﻫﺍ ﻮﻗﺩ ﺄُﻠﺼﻘﺗ ﻊﻠﻴﻫ ﺐﻃﺎﻗﺓ ﻚُﺘﺑ ﻊﻠﻴﻫﺍ ﻡﺮﺑﻯ ﺎﻠﺑﺮﺘﻗﺎﻟ ﻞﻜﻨﻫ ﻞﺳﻭﺀ.

I am not familiar with Arabic script (we are investigating the issue with a native speaker), so there should be something triggering the error, but it's strange because I have tried to parse the same sentence with another parser (UDpipe 2) and the same model, and it parses into 16 sentences.

many thanks!

@rahonalab rahonalab added the bug label May 13, 2024
@AngledLuffa
Copy link
Collaborator

Unfortunately this general issue has come up before with Arabic. The dataset we use has a conversion process in which "sentences" are not actually distinguished from each other in any meaningful way. In general, it looks like 900+ of the 6000 training "sentences" are sentences merged together like yours.

UniversalDependencies/UD_Arabic-PADT#3

I haven't really considered what we could do about this, aside from possibly finding a new data source or resplitting the text ourselves. Neither of which have much momentum for them.

It's possible that if we made a post-processing step in which Arabic in particular gets split on . that might be a general improvement.

There's also the NYUAD treebank, but we haven't tried using it because it's inconvenient to merge the raw text. I suppose we could try, though, since we do have the LDC corpora needed.

https://github.com/UniversalDependencies/UD_Arabic-NYUAD

UniversalDependencies/UD_Arabic-NYUAD#3

update: the sentence tokenizer isn't doing much better on the NYUAD treebank either. It seems to have basically the same problem, that a lot of "sentences" have . in the middle of them and therefore the tokenizer spazzes and doesn't learn to do anything useful.

Contrast with the most common failure mode of the English tokenizer, where the training data sometimes doesn't have any sentence final punctuation, so the tokenizer learns to occasionally split in the middle of a sentence especially when it sees a capital name. The English problem we can fix with an upgraded tokenizer model, but the Arabic problem I don't see how to fix unless we get a better data source.

@lancioni
Copy link

lancioni commented May 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants