Big-Data-Project

This project is part of the "Big Data" module provided in the fifth of our engineering training.

The goal of the project is to build a document classifier using the k-means++ clustering algorithm starting from a lake of document labeling data.

Datasets: https://www.kaggle.com/nzalake52/new-york-times-articles

Big Data / machine learning

- Preprocessing / cleaning of the corpora
- Feature engineering
- Feature selection
- Vectorization
- Clustering using K-means++ in SPARK
- Cluster evaluation metric: purity metric, silhouette score
- Optimization of metrics results

Data Vizuation

- Metadata's association for each document
- Preparation of CSV exports for Data Viz on tensorflow projector: https://projector.tensorflow.org/
- Convergence of clusters

Remarks

Each trained cluster will represent a class of the data lake. The accuracy of the classes is evaluated using the "purity metrics"

Results

- Purity metric: 0.9/1

Documents

- Scientific paper used: https://www.researchgate.net/publication/236124137_Information-theoretic_Term_Weighting_Schemes_for_Document_Clustering
- Theoretical project report: https://drive.google.com/file/d/1OIW3TIMq9uRz1FFarJ_ukKfHDp8RKipa/view?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
__pycache__		__pycache__
README.md		README.md
main_big_data_projet.py		main_big_data_projet.py
notes.txt		notes.txt
nytimes_news_articles.txt		nytimes_news_articles.txt
others.py		others.py
preprocessing_dataset.py		preprocessing_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data-Project

Big Data / machine learning

Data Vizuation

Remarks

Results

Documents

About

Releases

Packages

Contributors 2

Languages

landrydipanda/Big-Data-Project

Folders and files

Latest commit

History

Repository files navigation

Big-Data-Project

Big Data / machine learning

Data Vizuation

Remarks

Results

Documents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages