A little tool collection to help you collecting data from GitHub for research. This tool is based on my blog post: Systematic review of repositories on GitHub with python (Game Dev Style)
Note: This repository is inspired by the work of: Department of Information and Computing Sciences, Utrecht University: A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare by Zhengru Shen and Marco Spruit.
$ git clone https://github.com/simonrenger/collect-data-from-github.git
$ pip install PyGithub
$ pip install pandas
Call the help function:
python collect.py --help
You need to provide a config.json
file:
Field | Type | Optional | Description |
---|---|---|---|
token | string | Yes | If present it should contain a valid GitHub Token. You can obtain it here: Settings/Token. Scopes: repos . If not provided --token {TOKEN} needs to be used |
readme_dir | string | Yes | If present the tool will automatically download GitHub readme files into this location. |
output | string | Yes | If present the tool will store the found data in this location. Default: ./ |
format | string | yes | If present it determines the output format. Valid input: JSON , CSV , HTML , MARKDOWN . Default: CSV |
criteria | object | No | Must contain a entry called time with the fields min or max |
terms | array | No | List of search terms in accordance to the GitHub Syntax API: Understanding the search syntax |
attrs | array | No | List of attributes from the repo GitHub REST API object |
Note: There is a sample config in the
samples
folder
The previous command will give you some ideas on how to run it. But there is a faster way:
python collect.py config.json
And if you want to pass a token along:
python collect.py --token my_token config.json
- Add more criteria to filter repos on e.g. Languages
- Add possibility to avoid archived repos if wanted