Skip to content

Algrithmo to test and compare different unsupervised clustering algorithm

License

Notifications You must be signed in to change notification settings

FrankyDBravo/ClusteringComparator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clustering Algorithm Comparison

This project aims to compare different clustering algorithms and evaluate their performance using various metrics.

Table of Contents

Introduction Algorithms Evaluation Metrics Usage Contributing License

Introduction

Clustering is a technique used in machine learning and data analysis to group similar data points together. There are several clustering algorithms available, and each has its own strengths and weaknesses. In this project, we compare the performance of the following clustering algorithms:

  • K-Means
  • Hierarchical Clustering
  • DBSCAN

Algorithms

K-Means

K-means is a centroid-based clustering algorithm that aims to partition data points into K clusters. The algorithm works by first selecting K random points as centroids and then iteratively assigning data points to the closest centroid and updating the centroid position. The algorithm terminates when the centroids no longer move significantly.

Hierarchical Clustering

Hierarchical clustering is a bottom-up clustering algorithm that creates a dendrogram to represent the relationships between data points. The algorithm starts by treating each data point as a separate cluster and then repeatedly merges the closest pairs of clusters until only one cluster remains.

DBSCAN

DBSCAN is a density-based clustering algorithm that groups together data points that are closely packed together. The algorithm works by selecting a random data point and finding all the neighboring data points within a specified radius. The algorithm continues to expand the cluster by adding neighboring points until there are no more neighboring points within the specified radius.

Evaluation Metrics

To evaluate the performance of each clustering algorithm, we use the following metrics:

  • Silhouette Score
  • Davies-Bouldin Index
  • Calinski-Harabasz Index

Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a score closer to 1 indicates that the object is well-matched to its own cluster and poorly-matched to neighboring clusters.

Davies-Bouldin Index

The Davies-Bouldin Index measures the similarity between clusters and the distance between clusters. The index ranges from 0 to infinity, where a lower score indicates better clustering.

Calinski-Harabasz Index

The Calinski-Harabasz Index measures the ratio of between-cluster variance to within-cluster variance. The index ranges from 0 to infinity, where a higher score indicates better clustering.

Usage

To run the comparison of clustering algorithms, follow these steps:

Clone this repository to your local machine. Install the required dependencies by running pip install -r requirements.txt. Open clustering_comparison.ipynb in Jupyter Notebook or JupyterLab. Follow the instructions in the notebook to load the data and compare the clustering algorithms.

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Improvements

  • Add clutering algorithm
  • Add weighted importance to the metrics for the computation of the final score of the model
  • Compute the difference between groups to see if some groups are closer to each other than others

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Algrithmo to test and compare different unsupervised clustering algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published