Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

congress.gov returning 403 for rvest::read_html() #1

Open
judgelord opened this issue Mar 24, 2024 · 8 comments
Open

congress.gov returning 403 for rvest::read_html() #1

judgelord opened this issue Mar 24, 2024 · 8 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@judgelord
Copy link
Owner

It looks like congress.gov is blocking whatever protocol rvest uses. I'm not sure what to do about this and don't have time to dig in right now, but I will try to figure it out.

> rvest::read_html("https://www.un.org/en/")
{html_document}
<html dir="ltr" lang="en">
[1] <head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta charset="utf-8">\n<meta http- ...
[2] <body class="html front not-logged-in one-sidebar sidebar-first page-node i18n-en">\r\n\ ...
> rvest::read_html("https://www.congress.gov/")
Error in open.connection(x, "rb") : HTTP error 403.
@ReneRejonP
Copy link

Hi @judgelord,
Thanks for developing this package! It's great!
I'm trying to use it to scrape the congressional records and do some text mining for an academic article. Unfortunately, faced this same issue and have no idea how to fix it. Would appreciate any updates! Thanks again for developing this!

@judgelord
Copy link
Owner Author

If you want to help, you could test out alternative web scraping packages in R. I can replace the rvest method if another method works.

@ReneRejonP
Copy link

For sure! I’ll spend a few more hours on this next week. If I find another method, I’ll let you know!

@judgelord judgelord added bug Something isn't working help wanted Extra attention is needed labels Apr 24, 2024
@judgelord
Copy link
Owner Author

Update: it seems that congress.gov is no longer blocking us

@Nuohai-muxi
Copy link

I wrote a python code to substitute the scraper.

@judgelord
Copy link
Owner Author

I wrote a python code to substitute the scraper.

@Nuohai-muxi could you post a link to a repo?

@judgelord
Copy link
Owner Author

FWIW
rvest::read_html("https://www.congress.gov") works --- if there are errors with this package's functions returning 403 errors, it may be due to backslashes at the end of URLs, which seem to make congress.gov return a 403. I will investigate.

@Nuohai-muxi
Copy link

@judgelord https://github.com/Nuohai-muxi/scraper-for-US-congress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants