Web crawling hyperlink neighborhood of the Wikipedia entry for "Squirrel". Why squirrel? Because squirrels are awesome and as we shall see have many unexpected connections to reveal. The goal is to scrape all the Wikipedia webpages in the two-step neighborhood of Squirrel, i.e. all those pages that can be reached with one or two clicks from the Squirrel page.
We will be using rvest to do the scraping of the webpages. rvest
is a great package for obtaining information from a specific website. However, it does not naturally crawl, i.e. automatically move from one webpage to the next. Hence, a major part of the code revolves around teaching rvest
to crawl, so to speak. This mostly involves a while-loop and some regular expressions. Finally, we will make some pretty graphs using the igraph and diagram packages.
The code and techniques discussed here are very much general purpose. When you embark on your own web crawling quest, you should keep two things in mind:
-
Too many requests in too little time will most likely get your IP blocked. After each webpage request use something like
Sys.sleep(1)
. -
In general, before rvesting a website, check their rules for web crawling at website-you-want-to-crawl.{com, org, co.uk,...}/robots.txt.
- For example https://en.wikipedia.org/robots.txt
- This file tells us which directories to stay away from,
- Watch out for "User-agent: *", these rules apply to all crawlers,
User-agent: *
Disallow: /
Means: "all robots, stay away from all directories" and you should refrain from crawling that website.
- Visit http://www.robotstxt.org for more info.
The script wikipedia_webcrawling.R
contains the code for the actual web crawling. The script post_processing.R
contains the code for cleaning up and exploring the scraped data. There are two main functions find_path
finds all the paths from Squirrel to a given target topic and displays the result in data frame format. graph_path
is the graphical version of find_path
and displays all the paths as a graph.
This directory also comes with two .RData
files: one-step.RData
contains the one-step neighborhood of Squirrel, i.e. all the direct neighbors of the squirrel webpage and all the edges between them. The file two-step.RData
contains the full two-step neighborhood, i.e. all topics that can be reached with two clicks from the squirrel page and all the links between these topics.