Repository for the 4th homework of the course ADM @ Sapienza University of Rome from group group #23 composed by Francisca Alliende, Giuseppe Calabrese and Francesco Russo
Incoming, a summary of the files of this repository. To access to a document just press the link in the name of the corresponding file.
Jupiter Notebook, with the code and coments of the entire homework
- scraper.py: functions related to the scraping process.
- preprocessing.py: functions related to the preprocesing process.
- matrixbuilder.py: functions that build the information and the description matrices.
- clustering.py: functions of k-means++, Elbow Method and Jaccard Similarity.
- wordcloudgenerator.py: wordcloud generator function.
- mykmeans.py: homemade k-means algorithm.
- datasetindex.csv: database with all the announcements after the scrapping process.
- datasetIndex_preprocessed.csv: database that contains the data from "datasetindex.csv", prepocessed.
- datastIndex_infmatrix.csv: database with the informatrion matrix. Input for clusterization.
- datastIndex_tfidf.csv: database, with the description matrix. Input for clusterization. Unfortunately not available due to its weight.
All the code and comments of this parts, they are contained in the main file Homework_4