A collection of scrapers for gathering data from grant funders, intended to be used in the Beehive funding platform.
Written using python3 and scrapy
- Clone into new directory
git clone https://github.com/TechforgoodCAST/beehive-scrapers.git
- Setup virtual environment
python3 venv env
- Enter virtual environment
source env\bin\activate
(linux) orenv\Scripts\activate
(windows) - Install requirements
pip install -r requirements.txt
- (Windows only) install pypiwin32:
pip install pypiwin32
Run the command:
scrapy genspider -t fund_spider fundname "fundurl.com/path-to-fund-list"
Where:
fundname
is the name of the funder (all lowercase, no spaces or special characters)"fundurl.com/path-to-fund-list"
should be the URL of the fund list page.
This will generate a skeleton scraper with the capability to:
- go through a fund list page
- generate titles and links for each fund
- go to a particular fund page and get more details
- go to the next page if the fund list is on more than one page
You'll need to adjust the css selectors depending on the exact structure of the list page.
To output funds found to a funds.jl
JSON lines file
run: scrapy crawl comicrelief -o funds.jl
To run all spiders use the following command:
python funderscrapers/crawl_all.py
You can also use crawl_all.bat
in Windows or ./crawl_all.sh
in Bash.