Clustering Algorithm Comparison

This project aims to compare different clustering algorithms and evaluate their performance using various metrics.

Introduction

Clustering is a technique used in machine learning and data analysis to group similar data points together. There are several clustering algorithms available, and each has its own strengths and weaknesses. In this project, we compare the performance of the following clustering algorithms:

K-Means
Hierarchical Clustering
DBSCAN

Algorithms

K-Means

K-means is a centroid-based clustering algorithm that aims to partition data points into K clusters. The algorithm works by first selecting K random points as centroids and then iteratively assigning data points to the closest centroid and updating the centroid position. The algorithm terminates when the centroids no longer move significantly.

Hierarchical Clustering

Hierarchical clustering is a bottom-up clustering algorithm that creates a dendrogram to represent the relationships between data points. The algorithm starts by treating each data point as a separate cluster and then repeatedly merges the closest pairs of clusters until only one cluster remains.

DBSCAN

DBSCAN is a density-based clustering algorithm that groups together data points that are closely packed together. The algorithm works by selecting a random data point and finding all the neighboring data points within a specified radius. The algorithm continues to expand the cluster by adding neighboring points until there are no more neighboring points within the specified radius.

Evaluation Metrics

To evaluate the performance of each clustering algorithm, we use the following metrics:

Silhouette Score
Davies-Bouldin Index
Calinski-Harabasz Index

Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a score closer to 1 indicates that the object is well-matched to its own cluster and poorly-matched to neighboring clusters.

Davies-Bouldin Index

The Davies-Bouldin Index measures the similarity between clusters and the distance between clusters. The index ranges from 0 to infinity, where a lower score indicates better clustering.

Calinski-Harabasz Index

The Calinski-Harabasz Index measures the ratio of between-cluster variance to within-cluster variance. The index ranges from 0 to infinity, where a higher score indicates better clustering.

Usage

To run the comparison of clustering algorithms, follow these steps:

Clone this repository to your local machine. Install the required dependencies by running pip install -r requirements.txt. Open clustering_comparison.ipynb in Jupyter Notebook or JupyterLab. Follow the instructions in the notebook to load the data and compare the clustering algorithms.

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Improvements

Add clutering algorithm
Add weighted importance to the metrics for the computation of the final score of the model
Compute the difference between groups to see if some groups are closer to each other than others

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
modules		modules
.gitignore		.gitignore
Examples.ipynb		Examples.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering Algorithm Comparison

Table of Contents

Introduction

Algorithms

K-Means

Hierarchical Clustering

DBSCAN

Evaluation Metrics

Silhouette Score

Davies-Bouldin Index

Calinski-Harabasz Index

Usage

Contributing

Improvements

License

About

Releases

Packages

Languages

License

FrankyDBravo/ClusteringComparator

Folders and files

Latest commit

History

Repository files navigation

Clustering Algorithm Comparison

Table of Contents

Introduction

Algorithms

K-Means

Hierarchical Clustering

DBSCAN

Evaluation Metrics

Silhouette Score

Davies-Bouldin Index

Calinski-Harabasz Index

Usage

Contributing

Improvements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages