CmonCrawl-Fetcher

Downloads desired files from Common Crawl data by filetype

Usage

usage: cmoncrawl-fetcher.py [-h] -l <limit> -f FILETYPES [FILETYPES ...] [-p NUM_PROCS] -o OUTPUT [-t TOLERANCE]

Python package that downloads files from common crawler's database. An example usage is `cmoncrawl-fetcher.py -l 5 -f jpg png -o out_dir` This'll make it download 5 jpgs and 5 pngs into out_dir

options:
  -h, --help            show this help message and exit
  -l <limit>, --limit <limit>
                        Number of images per filetype desired
  -f FILETYPES [FILETYPES ...], --filetypes FILETYPES [FILETYPES ...]
                        Desired filetypes to fetch NOTE: check the config file: filetype_config.json, if the desired filetype is in there, make sure the filetype passed in lines up. Put in '*' for all
                        filetypes
  -p NUM_PROCS, --num_procs NUM_PROCS
                        Number of processes to use, default is 1
  -o OUTPUT, --output OUTPUT
                        Output directory to store downloaded files
  -t TOLERANCE, --tolerance TOLERANCE
                        Number of fails for a given hostname before we ignore this host

Required arguments are the desired file type (input the extension), the output directory, and the number of desired files for each extension.

By default we prioritize those with Content Types that signify the filetype over the extension, we store corresponding content types for file types in filetype_config.json, we currently support 69 file types.

Contributing

Package installation uses poetry, but this is subject to change in the future.
We use git-flow workflows for our development. Depending on the kind of feature you are contributing, create a hotfix branch or a feature branch. Installation instructions are here.
We also require a pre-commit hook. You can follow the instructions here to install them.
You will need to install the hooks in the yaml file in the repository using the following command: pre-commit install.

License

We've released this project under GPLv3. Check the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
cmoncrawl-fetcher		cmoncrawl-fetcher
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CmonCrawl-Fetcher

Usage

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

bread-b4nk/cmoncrawl-fetcher

Folders and files

Latest commit

History

Repository files navigation

CmonCrawl-Fetcher

Usage

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages