Imports raw JSON to Elasticsearch in a multi-thread way
We have 5 state here
- Only validating data
- Import data to ElasticSearch without validation
- Import using single-thread
- Import using multi-thread
- Import data to ElasticSearch after validation
- Import using single-thread
- Import using multi-thread
Install the elasticsearch package with pip :
pip install elasticsearch
Read more about versions here
--data : The data file
--check : Validate data file
--bulk : ElasticSearch endpoint ( http://localhost:9200 )
--index : Index name
--type : Index type
--import : Import data to ES
--thread : Threads amount, default = 1
--help : Display help message
I suggest you check your data before ( or during ) import process
python import.py --data test_data.json --check
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --thread 16
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check --thread 16
We have much faster process using multi-thread way. It depends on your computer/server resources. This script used linecache
to put data in RAM, so you need enough memory capacity too
- AMD Ryzen 3800X ( 8 core / 16 thread )
- 64GB Ram ( 3000MHz / CL16 )
- Windows 10
- 10Gb JSON file with ~24 million objects
- Elasticsearch v7
The whole process took about ~30 minutes and the usage of resources were efficient
- Fork it!
- Create your feature branch :
git checkout -b my-new-feature
- Commit your changes :
git commit -am 'Add some feature'
- Push to the branch :
git push origin my-new-feature
- Submit a pull request :D
Each project may have many problems. Contributing to the better development of this project by reporting them