Text Similarity and Clustering

This project analyzes text data using natural language processing (NLP) techniques to find similarity between captions, leveraging TF-IDF and cosine similarity. Additionally, it clusters captions using DBSCAN and calculates evaluation metrics like silhouette score and Davies-Bouldin index.

Recommended Threshold

⚠️ Important: For optimal results, it is recommended to set the similarity threshold to 40%. This threshold balances precision and recall, ensuring that similar captions are accurately identified without including too many false positives.

Features

Text Preprocessing: Removes special characters and stopwords, and performs lemmatization.
TF-IDF Vectorization: Converts text to vectors for comparison.
Cosine Similarity: Calculates similarity between pairs of captions.
DBSCAN Clustering: Groups similar captions.
Similarity Thresholds: Saves pairs of captions that meet similarity thresholds.
Memory Profiling: Tracks memory usage of the program.

Setup

Install the required Python packages:
```
pip install -r requirements.txt
```

Download necessary NLTK resources( automatically download in every runs ):

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

How to Use

Input: Place your text data in a file named captions.txt.
Run the Script: Preprocess captions, calculate TF-IDF vectors, and find similar pairs of captions:
```
python main.py
```
Output: Similar caption pairs for different thresholds (10% to 100%) are saved in result texts like similar_pairs_10-100%.txt.

Sample Output file (similar_pairs_10-100%.txt):

Caption 0 and Caption 1 are 72.50% similar
Caption 0: This is a sample caption.
Caption 1: Another sample text.

TODO

instead of working with pairs , use group text Clustering ✔️
Implement a method to support list and dict of texts ✔️
OOP optimization
Support multitext files and texts ✔️
run just using a method ✔️

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
.gitignore		.gitignore
DBSCAN.py		DBSCAN.py
README.md		README.md
captions.txt		captions.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Similarity and Clustering

Recommended Threshold

Features

Setup

How to Use

TODO

About

Releases

Packages

Contributors 2

Languages

AmirUsefian/text-similarity-clustering

Folders and files

Latest commit

History

Repository files navigation

Text Similarity and Clustering

Recommended Threshold

Features

Setup

How to Use

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages