Skip to content

stefan-stein/Wikipedia_webcrawling

Repository files navigation

Wikipedia Web Crawling

Web crawling hyperlink neighborhood of the Wikipedia entry for "Squirrel". Why squirrel? Because squirrels are awesome and as we shall see have many unexpected connections to reveal. The goal is to scrape all the Wikipedia webpages in the two-step neighborhood of Squirrel, i.e. all those pages that can be reached with one or two clicks from the Squirrel page.

We will be using rvest to do the scraping of the webpages. rvest is a great package for obtaining information from a specific website. However, it does not naturally crawl, i.e. automatically move from one webpage to the next. Hence, a major part of the code revolves around teaching rvest to crawl, so to speak. This mostly involves a while-loop and some regular expressions. Finally, we will make some pretty graphs using the igraph and diagram packages.

A general word of caution: ratelimits and robots.txt

The code and techniques discussed here are very much general purpose. When you embark on your own web crawling quest, you should keep two things in mind:

  • Too many requests in too little time will most likely get your IP blocked. After each webpage request use something like Sys.sleep(1).

  • In general, before rvesting a website, check their rules for web crawling at website-you-want-to-crawl.{com, org, co.uk,...}/robots.txt.

    User-agent: *

    Disallow: /

    Means: "all robots, stay away from all directories" and you should refrain from crawling that website.

The code

The script wikipedia_webcrawling.R contains the code for the actual web crawling. The script post_processing.R contains the code for cleaning up and exploring the scraped data. There are two main functions find_path finds all the paths from Squirrel to a given target topic and displays the result in data frame format. graph_path is the graphical version of find_path and displays all the paths as a graph.

This directory also comes with two .RData files: one-step.RData contains the one-step neighborhood of Squirrel, i.e. all the direct neighbors of the squirrel webpage and all the edges between them. The file two-step.RData contains the full two-step neighborhood, i.e. all topics that can be reached with two clicks from the squirrel page and all the links between these topics.

About

Webcrawling the Wikipedia hyperlink network

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages