Skip to content

The goal of the project is to build a document classifier using the k-means++ clustering algorithm starting from a lake of document labeling data.

Notifications You must be signed in to change notification settings

landrydipanda/Big-Data-Project

Repository files navigation

Big-Data-Project

This project is part of the "Big Data" module provided in the fifth of our engineering training.

The goal of the project is to build a document classifier using the k-means++ clustering algorithm starting from a lake of document labeling data.

Big Data / machine learning

- Preprocessing / cleaning of the corpora
- Feature engineering
- Feature selection
- Vectorization
- Clustering using K-means++ in SPARK
- Cluster evaluation metric: purity metric, silhouette score
- Optimization of metrics results

Data Vizuation

- Metadata's association for each document
- Preparation of CSV exports for Data Viz on tensorflow projector: https://projector.tensorflow.org/
- Convergence of clusters

Remarks

Each trained cluster will represent a class of the data lake. The accuracy of the classes is evaluated using the "purity metrics"

Results

- Purity metric: 0.9/1

Documents

- Scientific paper used: https://www.researchgate.net/publication/236124137_Information-theoretic_Term_Weighting_Schemes_for_Document_Clustering
- Theoretical project report: https://drive.google.com/file/d/1OIW3TIMq9uRz1FFarJ_ukKfHDp8RKipa/view?usp=sharing

About

The goal of the project is to build a document classifier using the k-means++ clustering algorithm starting from a lake of document labeling data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages