This repository will contain a demo using Weaviate with data and metadata from the arXiv dataset.
The code is tested with Python version 3.8.5.
- Spin up a default Weaviate instance with docker-compose (see https://weaviate.io/developers/weaviate/current/getting-started/installation.html#docker-compose).
- Run
python start_project.py
, with the following optional arguments. If a config file (-cf CONFIG_FILE, --config_file CONFIG_FILE
) is given, all other parameters are ignored:
short argument | long argument | default value | description |
---|---|---|---|
-cf | --config_file | config file name | |
-i | --metadata_file | data/arxiv-metadata-oai-snapshot.json | location and name of the arXiv metadata json file |
-s | --schema | project/schema.json | location and name of the schema |
-w | --weaviate | http://localhost:8080 | weaviate url |
-np | --n_papers | 1000000000 | maximum number of papers to import |
-snp | --skip_n_papers | 0 | number of papers to skip before starting the import |
-po | --papers_only | false | skips all other data object imports except for papers if set to True, and ignores --skip_journals, --skip_authors and --skip_taxonomy |
-sj | --skip_journals | false | whether you want to skip the import of all the journals |
-sa | --skip_authors | false | whether you want to skip the import of all the authors |
-st | --skip_taxonomy | false | whether you want to skip the import of all the arxiv taxonomy objects |
-to | --timeout | 20 | max time out in seconds for the python client batching operations |
-ows | --overwrite_schema | false | overwrites the schema if one is present and one is given |
-bs | --batch_size | 512 | maximum number of data objects to be sent in one batch |
- If you want to import the whole arXiv dataset of 2.65GB, make sure you have enough memory resources available in your environment (and Docker setup, I allocated 200GB for the Docker image size).
- In addition, set the
--timeout
parameter to at least 50, to avoid batches to fail because of longer read and write times. - Moreover, make sure to allocate enough memory for ES, by setting
ES_JAVA_OPTS: -Xms4g -Xmx4g
indocker-compose.yaml
Branch | Status |
---|---|
Master |