RMP Professor Scraping #49

gursheyss · 2024-04-14T20:03:29Z

To start with the scraping, we're going to implement the RMP scraper for professors. We're going to be using their GraphQL API for this.

To start off, we need to get information on how their api works.

If we go to our schools page https://www.ratemyprofessors.com/school/881 and open the network tab in devtools, we can see what requests we're sending to get the professor data.

When scrolling down and clicking 'show more', we can see in the network tab that a request is being sent to https://www.ratemyprofessors.com/graphql. Clicking on the request and going to the Request tab in the new menu tells us what data we're sending to this endpoint

Here we can see that we're sending a query along with a variables object, so we'll need to replicate this for our scraper.

Also notice in the headers tab, we have a Authorization header that says Basic dGVzdDp0ZXN0. If we decode the dGVzdDp0ZXN0 using (https://www.base64decode.org/)[base64], we can see it says test:test, which will be the username:password we need to send the request with.

First, in our root directory of the repo, we'll make the following files: scraper/scrapers/rmp.py and scraper/main.py
In our rmp.py, we'll start by importing requests and having our constants at the top: RMP_URL = "https://www.ratemyprofessors.com/graphql"
AUTH_USERNAME = "test"
AUTH_PASSWORD = "test"
Then, we'll define a function called get_professors and store our query in a variable from our devtools.
Since we want all of the professors in one request, we can update this text in the first fragment to give us the first 10000 professors. teachers(query: $query, first: 10000, after: "") {
We'll also have a variables dictionary with the same structure as our variables.
We then want to send the post request passing in all our data

response = requests.post(
        RMP_URL,
        json={"query": query, "variables": variables},
        auth=(AUTH_USERNAME, AUTH_PASSWORD)
    )

If it's successful, we'll just print it to the console for now
Then we'll just call it from our scraper/main.py

from scrapers import rmp

def main():
    rmp_professors = rmp.get_professors()
    if rmp_professors:
        print("RMP Professors:")
        for professor in rmp_professors:
            print(professor)

main()

If everything went well, something like this should be printed to the console

{'avgDifficulty': 3.8, 'avgRating': 3.1, 'department': 'Psychology', 'firstName': 'Michael', 'lastName': 'Dillinger', 'numRatings': 48, 'wouldTakeAgainPercent': 0}
{'avgDifficulty': 5, 'avgRating': 1.8, 'department': 'Business', 'firstName': 'Ellen', 'lastName': 'Zheng', 'numRatings': 3, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Art', 'firstName': 'Ryan', 'lastName': 'Eways', 'numRatings': 0, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Mathematics', 'firstName': 'Antonio', 'lastName': 'Punzo', 'numRatings': 0, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 2.5, 'avgRating': 4.5, 'department': 'Chicano Studies', 'firstName': 'Estevan', 'lastName': 'Azcona', 'numRatings': 2, 'wouldTakeAgainPercent': 100}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Physics', 'firstName': 'Zak ', 'lastName': 'Espley', 'numRatings': 0, 'wouldTakeAgainPercent': 100}```

The text was updated successfully, but these errors were encountered:

gursheyss added the backend label Apr 14, 2024

jonathanguven self-assigned this Apr 17, 2024

jonathanguven linked a pull request May 19, 2024 that will close this issue

Professor Scraper #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMP Professor Scraping #49

RMP Professor Scraping #49

gursheyss commented Apr 14, 2024 •

edited by jonathanguven

Loading

RMP Professor Scraping #49

RMP Professor Scraping #49

Comments

gursheyss commented Apr 14, 2024 • edited by jonathanguven Loading

gursheyss commented Apr 14, 2024 •

edited by jonathanguven

Loading