This project aims to compare different clustering algorithms and evaluate their performance using various metrics.
Introduction Algorithms Evaluation Metrics Usage Contributing License
Clustering is a technique used in machine learning and data analysis to group similar data points together. There are several clustering algorithms available, and each has its own strengths and weaknesses. In this project, we compare the performance of the following clustering algorithms:
- K-Means
- Hierarchical Clustering
- DBSCAN
K-means is a centroid-based clustering algorithm that aims to partition data points into K clusters. The algorithm works by first selecting K random points as centroids and then iteratively assigning data points to the closest centroid and updating the centroid position. The algorithm terminates when the centroids no longer move significantly.
Hierarchical clustering is a bottom-up clustering algorithm that creates a dendrogram to represent the relationships between data points. The algorithm starts by treating each data point as a separate cluster and then repeatedly merges the closest pairs of clusters until only one cluster remains.
DBSCAN is a density-based clustering algorithm that groups together data points that are closely packed together. The algorithm works by selecting a random data point and finding all the neighboring data points within a specified radius. The algorithm continues to expand the cluster by adding neighboring points until there are no more neighboring points within the specified radius.
To evaluate the performance of each clustering algorithm, we use the following metrics:
- Silhouette Score
- Davies-Bouldin Index
- Calinski-Harabasz Index
The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a score closer to 1 indicates that the object is well-matched to its own cluster and poorly-matched to neighboring clusters.
The Davies-Bouldin Index measures the similarity between clusters and the distance between clusters. The index ranges from 0 to infinity, where a lower score indicates better clustering.
The Calinski-Harabasz Index measures the ratio of between-cluster variance to within-cluster variance. The index ranges from 0 to infinity, where a higher score indicates better clustering.
To run the comparison of clustering algorithms, follow these steps:
Clone this repository to your local machine. Install the required dependencies by running pip install -r requirements.txt. Open clustering_comparison.ipynb in Jupyter Notebook or JupyterLab. Follow the instructions in the notebook to load the data and compare the clustering algorithms.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
- Add clutering algorithm
- Add weighted importance to the metrics for the computation of the final score of the model
- Compute the difference between groups to see if some groups are closer to each other than others
This project is licensed under the MIT License - see the LICENSE file for details.