Skip to content

bread-b4nk/cmoncrawl-fetcher

Repository files navigation

CmonCrawl-Fetcher

Downloads desired files from Common Crawl data by filetype

Usage

usage: cmoncrawl-fetcher.py [-h] -l <limit> -f FILETYPES [FILETYPES ...] [-p NUM_PROCS] -o OUTPUT [-t TOLERANCE]

Python package that downloads files from common crawler's database. An example usage is `cmoncrawl-fetcher.py -l 5 -f jpg png -o out_dir` This'll make it download 5 jpgs and 5 pngs into out_dir

options:
  -h, --help            show this help message and exit
  -l <limit>, --limit <limit>
                        Number of images per filetype desired
  -f FILETYPES [FILETYPES ...], --filetypes FILETYPES [FILETYPES ...]
                        Desired filetypes to fetch NOTE: check the config file: filetype_config.json, if the desired filetype is in there, make sure the filetype passed in lines up. Put in '*' for all
                        filetypes
  -p NUM_PROCS, --num_procs NUM_PROCS
                        Number of processes to use, default is 1
  -o OUTPUT, --output OUTPUT
                        Output directory to store downloaded files
  -t TOLERANCE, --tolerance TOLERANCE
                        Number of fails for a given hostname before we ignore this host

Required arguments are the desired file type (input the extension), the output directory, and the number of desired files for each extension.

By default we prioritize those with Content Types that signify the filetype over the extension, we store corresponding content types for file types in filetype_config.json, we currently support 69 file types.

Contributing

  • Package installation uses poetry, but this is subject to change in the future.

  • We use git-flow workflows for our development. Depending on the kind of feature you are contributing, create a hotfix branch or a feature branch. Installation instructions are here.

  • We also require a pre-commit hook. You can follow the instructions here to install them.

  • You will need to install the hooks in the yaml file in the repository using the following command: pre-commit install.

License

We've released this project under GPLv3. Check the LICENSE file for more details.

About

Downloads desired filetypes from Common Crawl data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages