This is an open source project for preparing large language model data. Due to the fact that everyone is pre training and fine-tuning the volume model, most public projects also rarely mention the details of handling cleaning data.
I hope this project can help everyone to complete the data cleaning work as much as possible, so that everyone can focus more on model training and fine-tuning.
- Parse document
- Fixed format
- Document deduplication
- Document classification
- Document cluster
- Text cleaning