Lumix

This is an open source project for preparing large language model data. Due to the fact that everyone is pre training and fine-tuning the volume model, most public projects also rarely mention the details of handling cleaning data.

I hope this project can help everyone to complete the data cleaning work as much as possible, so that everyone can focus more on model training and fine-tuning.

Project structure

Parse document
Fixed format
Document deduplication
Document classification
Document cluster
Text cleaning

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Classification		Classification
Clean		Clean
Cluster		Cluster
Deduplication		Deduplication
Filter		Filter
Origin		Origin
Parser		Parser
Store		Store
docs		docs
README.md		README.md
README_zh.md		README_zh.md
clean_main.py		clean_main.py
fast.yml		fast.yml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lumix

Project structure

About

Releases

Packages

Languages

xcxhy/Lumix

Folders and files

Latest commit

History

Repository files navigation

Lumix

Project structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages