congress.gov returning 403 for `rvest::read_html()` #1

judgelord · 2024-03-24T19:06:24Z

It looks like congress.gov is blocking whatever protocol rvest uses. I'm not sure what to do about this and don't have time to dig in right now, but I will try to figure it out.

> rvest::read_html("https://www.un.org/en/")
{html_document}
<html dir="ltr" lang="en">
[1] <head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta charset="utf-8">\n<meta http- ...
[2] <body class="html front not-logged-in one-sidebar sidebar-first page-node i18n-en">\r\n\ ...
> rvest::read_html("https://www.congress.gov/")
Error in open.connection(x, "rb") : HTTP error 403.

The text was updated successfully, but these errors were encountered:

ReneRejonP · 2024-04-23T01:06:57Z

Hi @judgelord,
Thanks for developing this package! It's great!
I'm trying to use it to scrape the congressional records and do some text mining for an academic article. Unfortunately, faced this same issue and have no idea how to fix it. Would appreciate any updates! Thanks again for developing this!

judgelord · 2024-04-23T13:44:02Z

If you want to help, you could test out alternative web scraping packages in R. I can replace the rvest method if another method works.

ReneRejonP · 2024-04-23T21:11:12Z

For sure! I’ll spend a few more hours on this next week. If I find another method, I’ll let you know!

judgelord · 2024-05-31T18:05:56Z

Update: it seems that congress.gov is no longer blocking us

Nuohai-muxi · 2024-10-07T06:36:53Z

I wrote a python code to substitute the scraper.

judgelord · 2024-10-07T15:14:10Z

I wrote a python code to substitute the scraper.

@Nuohai-muxi could you post a link to a repo?

judgelord · 2024-10-07T15:15:40Z

FWIW
rvest::read_html("https://www.congress.gov") works --- if there are errors with this package's functions returning 403 errors, it may be due to backslashes at the end of URLs, which seem to make congress.gov return a 403. I will investigate.

Nuohai-muxi · 2024-10-08T00:32:48Z

@judgelord https://github.com/Nuohai-muxi/scraper-for-US-congress

judgelord mentioned this issue Mar 24, 2024

HTTP error 403 judgelord/cr#2

Open

judgelord mentioned this issue Apr 24, 2024

Rvest issues #2

Open

judgelord added bug Something isn't working help wanted Extra attention is needed labels Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

congress.gov returning 403 for `rvest::read_html()` #1

congress.gov returning 403 for `rvest::read_html()` #1

judgelord commented Mar 24, 2024

ReneRejonP commented Apr 23, 2024

judgelord commented Apr 23, 2024

ReneRejonP commented Apr 23, 2024

judgelord commented May 31, 2024

Nuohai-muxi commented Oct 7, 2024

judgelord commented Oct 7, 2024

judgelord commented Oct 7, 2024

Nuohai-muxi commented Oct 8, 2024

congress.gov returning 403 for rvest::read_html() #1

congress.gov returning 403 for rvest::read_html() #1

Comments

judgelord commented Mar 24, 2024

ReneRejonP commented Apr 23, 2024

judgelord commented Apr 23, 2024

ReneRejonP commented Apr 23, 2024

judgelord commented May 31, 2024

Nuohai-muxi commented Oct 7, 2024

judgelord commented Oct 7, 2024

judgelord commented Oct 7, 2024

Nuohai-muxi commented Oct 8, 2024

congress.gov returning 403 for `rvest::read_html()` #1

congress.gov returning 403 for `rvest::read_html()` #1