In order to use this crawler, just install scrapy and clone this repository.
$ pip3 install scrapy
$ git clone https://github.com/vksbhandary/nepali-news-crawler.git
$ cd nepali-news-crawler
$ scrapy crawl news_hamrakura -o hamrakura.csv -t csv
- hamrakura
- kantipurdaily
- onlinekhabar
- pahilopost
- wordpress website 1
- nepalitribune
- news24nepal
- nepalitimes
- You can use this for any wordpress website 2
-
Executing hamrakura crawler
$ scrapy crawl news_hamrakura -o hamrakura.csv -t csv
-
Executing kantipurdaily crawler
$ scrapy crawl kanti_news -o kantipur.csv -t csv
-
Executing onlinekhabar crawler
$ scrapy crawl news_onlinekhabar -o onlinekhabar.csv -t csv
-
Executing pahilopost crawler
$ scrapy crawl news_pahilo -o file.csv -t csv
-
Executing wordpress crawler
$ scrapy crawl wordpress_news -o news24nepal.csv -t csv
1 In order to use the wordpress website example you should follow steps:
- Open file
spiders/wordpress.py
- Edit line 14 to add your domain
- Open your terminal and execute
$ scrapy crawl wordpress_news -o news24nepal.csv -t csv
2 This crawler uses wordpress's RESTful API to fetch posts. Therefore a website should have enabled REST API for this crawler to work. In order to check if a wordpress website is supported by this crwaler
- Go to
yourdomainname.com
/wp-json/wp/v2/posts/ - If you see a bunch of Json data then its good to go
- If you see 404 error page or forbidden error page then its not supported.