This project is part of the "Big Data" module provided in the fifth of our engineering training.
The goal of the project is to build a document classifier using the k-means++ clustering algorithm starting from a lake of document labeling data.
- Preprocessing / cleaning of the corpora
- Feature engineering
- Feature selection
- Vectorization
- Clustering using K-means++ in SPARK
- Cluster evaluation metric: purity metric, silhouette score
- Optimization of metrics results
- Metadata's association for each document
- Preparation of CSV exports for Data Viz on tensorflow projector: https://projector.tensorflow.org/
- Convergence of clusters
Each trained cluster will represent a class of the data lake. The accuracy of the classes is evaluated using the "purity metrics"
- Purity metric: 0.9/1
- Scientific paper used: https://www.researchgate.net/publication/236124137_Information-theoretic_Term_Weighting_Schemes_for_Document_Clustering
- Theoretical project report: https://drive.google.com/file/d/1OIW3TIMq9uRz1FFarJ_ukKfHDp8RKipa/view?usp=sharing