Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Article titles truncated at - #632

Open
everonegraham opened this issue Apr 6, 2024 · 2 comments
Open

[BUG] Article titles truncated at - #632

everonegraham opened this issue Apr 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@everonegraham
Copy link

Describe the bug
I'm seeing instances where if - is in the title of the article:

  1. it only returns the characters before -
  2. its only returns the characters after -

To Reproduce
Steps to reproduce the behavior, please post any code you used and the website you tried to parse/process:

  1. Use an article with - in its title.
  2. call articles.title
  3. See result.

Expected behavior
Expected the full title of the article.

Screenshots

example for point 1:
url: https://www.jamstockex.com/transjamaican-highway-limited-tjh-trade-disclosure-3/
screenshot: image

example for point 2:
url: https://www.treblezine.com/plus-minus-announce-first-new-album-in-a-decade-further-afield/
screenshot: image

Extra example:
url: https://www.post-gazette.com/business/healthcare-business/2024/02/28/university-of-pittsburgh-drug-development-lansing-taylor-animal-testing/stories/202402280069
screenshot: image

System information

  • OS: [Windows]
  • Python version [3.12.1]
  • Library version [[0.9.3.1]

Additional context
Add any other context about the problem here.

@everonegraham everonegraham added the bug Something isn't working label Apr 6, 2024
@AndyTheFactory
Copy link
Owner

Hi,

this behavior is "as expected", but I agree, it's not optimal.

The problem it addresses is that some sites prepend the website name to the title (or postpend)
In your case, for instance, the <title> tag for the jamaican stock exchange is:

<title>TransJamaican Highway Limited (TJH)- Trade Disclosure - Jamaica Stock Exchange</title>
  you can see the Site name at the end, delimited with -

Having said this, I agree that this can be improved and I will have a look at some possible options

One would be to priorize the og:title tag, that shuld be, in my oppinion more accurate.

But I will need to test on a set of websites to see if this does not break some other behavior.

i will keep you posted

@everonegraham
Copy link
Author

No problem at all, thanks for the reply and I look forward to the possible fix. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants