Wikipedia Web Crawling

Web crawling hyperlink neighborhood of the Wikipedia entry for "Squirrel". Why squirrel? Because squirrels are awesome and as we shall see have many unexpected connections to reveal. The goal is to scrape all the Wikipedia webpages in the two-step neighborhood of Squirrel, i.e. all those pages that can be reached with one or two clicks from the Squirrel page.

We will be using rvest to do the scraping of the webpages. rvest is a great package for obtaining information from a specific website. However, it does not naturally crawl, i.e. automatically move from one webpage to the next. Hence, a major part of the code revolves around teaching rvest to crawl, so to speak. This mostly involves a while-loop and some regular expressions. Finally, we will make some pretty graphs using the igraph and diagram packages.

A general word of caution: ratelimits and robots.txt

The code and techniques discussed here are very much general purpose. When you embark on your own web crawling quest, you should keep two things in mind:

Too many requests in too little time will most likely get your IP blocked. After each webpage request use something like Sys.sleep(1).
In general, before rvesting a website, check their rules for web crawling at website-you-want-to-crawl.{com, org, co.uk,...}/robots.txt.
- For example https://en.wikipedia.org/robots.txt
- This file tells us which directories to stay away from,
- Watch out for "User-agent: *", these rules apply to all crawlers,
User-agent: *

Disallow: /

Means: "all robots, stay away from all directories" and you should refrain from crawling that website.
- Visit http://www.robotstxt.org for more info.

The code

The script wikipedia_webcrawling.R contains the code for the actual web crawling. The script post_processing.R contains the code for cleaning up and exploring the scraped data. There are two main functions find_path finds all the paths from Squirrel to a given target topic and displays the result in data frame format. graph_path is the graphical version of find_path and displays all the paths as a graph.

This directory also comes with two .RData files: one-step.RData contains the one-step neighborhood of Squirrel, i.e. all the direct neighbors of the squirrel webpage and all the edges between them. The file two-step.RData contains the full two-step neighborhood, i.e. all topics that can be reached with two clicks from the squirrel page and all the links between these topics.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Wikipedia_webcrawling.Rproj		Wikipedia_webcrawling.Rproj
checked_urls.txt		checked_urls.txt
one_step.RData		one_step.RData
post_processing.R		post_processing.R
two-step.RData		two-step.RData
wikipedia_crawling.R		wikipedia_crawling.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Web Crawling

A general word of caution: ratelimits and robots.txt

The code

About

Releases

Packages

Languages

stefan-stein/Wikipedia_webcrawling

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Web Crawling

A general word of caution: ratelimits and robots.txt

The code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages