Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to get the article :( #19

Open
luna533 opened this issue Aug 5, 2024 · 16 comments
Open

Failed to get the article :( #19

luna533 opened this issue Aug 5, 2024 · 16 comments

Comments

@luna533
Copy link

luna533 commented Aug 5, 2024

Attempting to open any Article will result in Failed to get the article :(
Sorry - working on it! Invalid or incomplete HTML.

@codeminkey
Copy link

Additionally, the link to the "Original source (on modern site)" is broken. The "Original source" URL has the same wrapper as the front page link, which sends it back to article.php (which tries to send it through "readability" again), and, of course, it fails again. As a work-around you can edit it in your browser's URL bar to remove the wrapper (which looks like this: "http://68k.news/article.php?loc=US&a=") and reload. I've been doing this for a few weeks now (it's getting a little old). Now that I've found the code, I'm gonna try to create a fix.

@Hodapp87
Copy link

Have been running into this too awhile.

@Pythone
Copy link

Pythone commented Sep 22, 2024

Makes me wonder if News Sites are against this because they love their freakin' DRM/anti-adblock. I have a feeling they are blocking 68k.news and FrogFind on purpose.

@luvcie
Copy link

luvcie commented Sep 30, 2024

I'm having the same issue :(

@Pythone
Copy link

Pythone commented Oct 1, 2024

Found the issue.
Google News forbids GoogleBot from accessing the 'rss' folder in the sites robots.txt.
If 68k.news uses GoogleBot(or imitates GoogleBot), they are not allowed.
Robots.txt is located here: https://news.google.com/robots.txt

@Pythone
Copy link

Pythone commented Oct 1, 2024

Best route is to bypass google news by manually decoding url.

@jibsaramnim
Copy link

I've tried looking into this a few times, it seems like Google has made some changes over time that breaks the current implementation, with no real clean way to implement a fix that I can see.

In short;

  • The current implementation assumes there is a hyperlink (<a href...) on the Google news article and that it points to the real article page, this is no longer the case
  • In a previous iteration of Google News, the long hash looking part of the article link was a basic base64 encoded version of the actual article URL, so a simple base64_decode() would have given us the URL, however;
  • Even more recently, Google seems to have further obfuscated things, requiring a (rate limited) POST request to a decidedly internal-use-only API with some obfuscated looking data to get the real URL based on the hash found in the article URL.

There are some existing code snippets out there (e.g. this one) that are successfully able to get the decoded URL, but as this is dependent on an internal-only "API" and seems to be rather actively rate limited too, I fear that attempting to implement this is at best just going to give us a very short-lived success story.


Is there perhaps any other source of news that can be considered that has an API or at least a URL structure that isn't changing as actively as Google News? It just seems like Google is very actively attempting to prevent outside parties from scraping or otherwise using them as a source.

@codeminkey
Copy link

codeminkey commented Oct 4, 2024

Not to contradict what anyone has said, and admittedly I don't know the code very well, but I want to point out that 68k IS getting the correct URL from GN. When we get the error page (article.php) it has an embedded link to the "Original page", and both that link and the one currently in the URL bar at that time contain the correct URL, but it is prefixed with "http://68k.news/article.php?loc=US&a=". I have been systematically editing the URL to remove that prefix and then the original page loads perfectly (it's a pain but it works and it's the only way 68k is currently usable to me).

I looked at article.php and it is sending that same URL (ie the same variable) both to the "reader mode" module and the "Original page" link. I have flirted with the idea of writing a hack that removes the prefix (if present), and I think that would make things work again. However, I want to research the source of the prefix+URL before doing anything and just haven't found the time.

@codeminkey
Copy link

I should add that what I just said suggests the issue is a lot simpler than some are making it out to be (i.e. it's not google; the added prefix clearly comes from 68k).
I will hopefully find some time to locate the source of the prefix+URL and look at the recent commits to whatever module it's in. I'll bet there was a bad commit that caused this.

@jibsaramnim
Copy link

but I want to point out that 68k IS getting the correct URL from GN. When we get the error page (article.php) it has an embedded link to the "Original page" (...)

What you're referring to here is the Google News link, not the original URL. This, as long as you open it in a browser with JavaScript enabled, redirects to the real article URL. It's that last link that 68k news needs as it tries to parse the original article and show it in a text-only way for old machines/browsers, just like its main list view.

(...) but it is prefixed with "http://68k.news/article.php?loc=US&a=". I have been systematically editing the URL to remove that prefix and then the original page loads perfectly (it's a pain but it works and it's the only way 68k is currently usable to me).

What you're describing here is simply visiting that original page directly, which on a modern (enough) browser will work fine of course. But for those that visit 68k news on vintage hardware or simply want a text-only experience, this won't do what they're looking for.

I will hopefully find some time to locate the source of the prefix+URL and look at the recent commits to whatever module it's in. I'll bet there was a bad commit that caused this.

The prefix, as you call it, is a necessary part of how 68k news works, as it (originally, before Google News changed things up) would render a text-only version of the original article contents. Without this people on vintage hardware/browsers can only reliably view the 68k homepage that lists out links to articles, but not be able to actually read each article — unless their browser does support whatever tech stack the particular website in question uses, of course.

Now that Google has changed things in a way that there is no obvious solution (yet) on how to make it work as it used to again, it sadly just shows a failed message. I think a useful addition here would be to have the link to the Google News article URL right there alongside the error message, but the bigger issue of course is that we'd like 68k news to go back to being able to actually fetch, parse, and render these articles again.

@Pythone
Copy link

Pythone commented Oct 5, 2024

I've tried looking into this a few times, it seems like Google has made some changes over time that breaks the current implementation, with no real clean way to implement a fix that I can see.

In short;

* The current implementation assumes there is a hyperlink (`<a href...`) on the Google news article and that it points to the real article page, this is no longer the case

* In a previous iteration of Google News, the long hash looking part of the article link was a basic `base64` encoded version of the actual article URL, so a simple `base64_decode()` would have given us the URL, however;

* Even more recently, Google seems to have further obfuscated things, requiring a (rate limited) POST request to a decidedly internal-use-only API with some obfuscated looking data to get the real URL based on the hash found in the article URL.

There are some existing code snippets out there (e.g. this one) that are successfully able to get the decoded URL, but as this is dependent on an internal-only "API" and seems to be rather actively rate limited too, I fear that attempting to implement this is at best just going to give us a very short-lived success story.

Is there perhaps any other source of news that can be considered that has an API or at least a URL structure that isn't changing as actively as Google News? It just seems like Google is very actively attempting to prevent outside parties from scraping or otherwise using them as a source.

I think it is possible to reverse engineer the "internal" API. Either that or we have to move away from Google News.

@Pythone
Copy link

Pythone commented Oct 5, 2024

Now, I figured out a possible alternative: fresh rss

@Pythone
Copy link

Pythone commented Oct 5, 2024

or tiny tiny rss.

@codeminkey
Copy link

@jibsaramnim:

What you're referring to here is the Google News link, not the original URL.

You're right. Sorry for the confusion. I read the source too hastily.

It would be helpful to remove the article.php wrapper from the "Original source" link, at least in the interim. It won't help folks on limited devices, but it will help those of us who use 68k just to avoid the bloat that google adds.

Incidentally, FWIW, I tried loading a google news link in dev tools and noticed a couple messages referring to ad insertion. I hadn't noticed inserted ads before, but I run uBlock Origin. I retested with it disabled and a giant banner (possibly video, I don't recall) appeared at the top of the page. Anyway, from this, I can speculate that one of the reasons (maybe the main one?) behind their recent changes are to protect their ad empire (like recent actions with youtube).

@Pythone
Copy link

Pythone commented Oct 13, 2024

@jibsaramnim:

What you're referring to here is the Google News link, not the original URL.

You're right. Sorry for the confusion. I read the source too hastily.

It would be helpful to remove the article.php wrapper from the "Original source" link, at least in the interim. It won't help folks on limited devices, but it will help those of us who use 68k just to avoid the bloat that google adds.

Incidentally, FWIW, I tried loading a google news link in dev tools and noticed a couple messages referring to ad insertion. I hadn't noticed inserted ads before, but I run uBlock Origin. I retested with it disabled and a giant banner (possibly video, I don't recall) appeared at the top of the page. Anyway, from this, I can speculate that one of the reasons (maybe the main one?) behind their recent changes are to protect their ad empire (like recent actions with youtube).

also, News publishers might see this site as "one giant ad-block" even though it is made for old technology and made Google disable it.

If Frogfind uses DuckDuckGo, 68k.news could possibly use duckduckgo news(If DDG News existed).
Best solution: use a self-hosted/"promise to allow us to still use it" alternative

@Pythone
Copy link

Pythone commented Oct 28, 2024

I have a temporary solution. Move from Google News to an aggregator hosted on GH Pages. Paid Hosting/Self-hosted is a more permanent solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants