Web scraping is an old way of sharing data between services. As per wikipedia, Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. This script is a simple web scraper which extracts basic information of any github user. Check https://github.com/varunon9/github-scraper/blob/master/data-beautify.json file to see extracted information. Though github provides APIs for the same, I wrote this for learning purposes.
- request: Helps us make HTTP calls
- cheerio: Implementation of core jQuery specifically for the server (helps us traverse the DOM and extract data)
- fs: Node File System (fs) module to implement file input/output
We have 3 steps in scraping-
- We load the github profile of given user by making GET request (request module of nodejs)
- Parse the HTML result (thanks to cheerio)
- Extract the needed data
For step 3, we must know corresponding DOM elements in advance. You can check this using 'inspect elements' feature
of browser. Visit github profile of any user and inspect elements by right click or pressing ctrl + shift + I
.
You can also see source code. See screenshot 1. Read index.js file for more details. Note that script will
no longer work once github changes its DOM elements. However you will have the idea and can re-write script.
- To execute this script you must have nodejs installed.
- Download zip file (or make git clone) and extract to hard disk
- Open terminal/cmd
- Move to script directory (where you extracted zip file) using
cd /path/to/repository
- Run
npm install
to install all nodejs dependencies - Once all the dependencies has been installed type
node index.js <url>
- Replace with url of github user (of which you want to extract information) e.g. https://github.com/varunon9
- Depending on your internet speed it will take some time. You can see output on screen once finished.
- script also write this data to hard-disk. Check user.json file in this directory.
- You will have to beautify json data to make it readable. You can visit https://jsonformatter.curiousconcept.com/
- You can check data-beautify.json which is extracted data after beautification.