Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMP Professor Scraping #49

Open
8 tasks done
gursheyss opened this issue Apr 14, 2024 · 0 comments · May be fixed by #53
Open
8 tasks done

RMP Professor Scraping #49

gursheyss opened this issue Apr 14, 2024 · 0 comments · May be fixed by #53
Assignees
Labels

Comments

@gursheyss
Copy link
Member

gursheyss commented Apr 14, 2024

To start with the scraping, we're going to implement the RMP scraper for professors. We're going to be using their GraphQL API for this.

To start off, we need to get information on how their api works.

If we go to our schools page https://www.ratemyprofessors.com/school/881 and open the network tab in devtools, we can see what requests we're sending to get the professor data.

When scrolling down and clicking 'show more', we can see in the network tab that a request is being sent to https://www.ratemyprofessors.com/graphql. Clicking on the request and going to the Request tab in the new menu tells us what data we're sending to this endpoint
image

Here we can see that we're sending a query along with a variables object, so we'll need to replicate this for our scraper.

Also notice in the headers tab, we have a Authorization header that says Basic dGVzdDp0ZXN0. If we decode the dGVzdDp0ZXN0 using (https://www.base64decode.org/)[base64], we can see it says test:test, which will be the username:password we need to send the request with.

  • First, in our root directory of the repo, we'll make the following files: scraper/scrapers/rmp.py and scraper/main.py

  • In our rmp.py, we'll start by importing requests and having our constants at the top: RMP_URL = "https://www.ratemyprofessors.com/graphql"
    AUTH_USERNAME = "test"
    AUTH_PASSWORD = "test"

  • Then, we'll define a function called get_professors and store our query in a variable from our devtools.

  • Since we want all of the professors in one request, we can update this text in the first fragment to give us the first 10000 professors. teachers(query: $query, first: 10000, after: "") {

  • We'll also have a variables dictionary with the same structure as our variables.

  • We then want to send the post request passing in all our data

response = requests.post(
        RMP_URL,
        json={"query": query, "variables": variables},
        auth=(AUTH_USERNAME, AUTH_PASSWORD)
    )
  • If it's successful, we'll just print it to the console for now
  • Then we'll just call it from our scraper/main.py
from scrapers import rmp

def main():
    rmp_professors = rmp.get_professors()
    if rmp_professors:
        print("RMP Professors:")
        for professor in rmp_professors:
            print(professor)

main()

If everything went well, something like this should be printed to the console

{'avgDifficulty': 3.8, 'avgRating': 3.1, 'department': 'Psychology', 'firstName': 'Michael', 'lastName': 'Dillinger', 'numRatings': 48, 'wouldTakeAgainPercent': 0}
{'avgDifficulty': 5, 'avgRating': 1.8, 'department': 'Business', 'firstName': 'Ellen', 'lastName': 'Zheng', 'numRatings': 3, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Art', 'firstName': 'Ryan', 'lastName': 'Eways', 'numRatings': 0, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Mathematics', 'firstName': 'Antonio', 'lastName': 'Punzo', 'numRatings': 0, 'wouldTakeAgainPercent': -1}
{'avgDifficulty': 2.5, 'avgRating': 4.5, 'department': 'Chicano Studies', 'firstName': 'Estevan', 'lastName': 'Azcona', 'numRatings': 2, 'wouldTakeAgainPercent': 100}
{'avgDifficulty': 0, 'avgRating': 0, 'department': 'Physics', 'firstName': 'Zak ', 'lastName': 'Espley', 'numRatings': 0, 'wouldTakeAgainPercent': 100}```
@jonathanguven jonathanguven self-assigned this Apr 17, 2024
@jonathanguven jonathanguven linked a pull request May 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants