GitHub - tanvir-ishraq/TextThema-Multi-Classifier: A multi-label text content classifier that can classify 103 different themes concurrently, given any Creative, poetry, Literary, Descriptive, synopsis, quotes, songs or Dialogues text content

TextThema - Themes Classifier & Recommender NLP System

A Text Content Themes Classifier and smart Recommender System. Here is documentation of the project process from data collection, model training, and deployment.
The model can classify 103 different types of themes from input. Despite the input content being of any nature such as Creative, poetry, Literary, Descriptive, synopsis, quotes, songs or Dialogues.

The keys of deployment_hf\tag_types_encoded.json shows all 103 theme types.

The goal of this project was multimodal. To cover diverse range of objectives.

detect nuances like humor and simile/metaphor
detect Author Intents and content themes
possibly detect plus recommend related books, famous people/authors, fictional character, franchise from input content. so, you can explore new horizons according to your interest
automate / recommend hashtags.

Data Collection

Data was collected from a Goodreads Literature text Listing. Containing Literary snippets, quotes, poetry: https://www.goodreads.com/quotes
The Dynamic website's unpredictable time based login prompts were handled. The data collection process was as follows:

The massive NLP data were scraped with scraper/nlp_dataset_scraper.py and stored in scraper/quote-nlp-dataset-scraped.csv. I employed an effecient tracking management system for scrap of massive NLP data to prevent data loss.

In total, I scraped ~43,000 different content style data e.g. including quote, poems, synopsis, dialogue, story and corresponding data's themes, tags. Finally, ~38,500 dataset reamined after cleaning.

Data Preprocessing

Initially there were lots of themes and tags in the dataset. After some analysis, I found out many of them are rare themes and tags (probably custom tags by users). So, I removed those tags and then I kept 103 themes and tags for intial test. After that, I removed any noisy data due to scraping and dropped any duplicate occurences resulting in ~38500 samples.

Model Training

Fine-tuned a roberta-base (RoBERTa) model which is a transformers model from HuggingFace Transformers along with hyperparameter tuning using Fastai and Blurr. With ~93% accuracy. The model training notebook can be viewed notebooks folder and here

Model Compression and ONNX Inference

The trained model has a memory of 900+MB. I compressed this model using ONNX quantization and brought it to 125MB. Achieving 86.11% reduction size.

Model Deployment

A Gradio App was coded to deploy the compressed model in HuggingFace Spaces. The implementation can be found in deployment folder or here

Live Website Deployment

Deployed a Flask App built to take text content and show the themes and recommendations as output. Check flask-deployment branch. The website is live here

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
deployment_hf		deployment_hf
github_img		github_img
models		models
notebooks		notebooks
scraper		scraper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextThema - Themes Classifier & Recommender NLP System

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Live Website Deployment

About

Releases

Packages

Languages

License

tanvir-ishraq/TextThema-Multi-Classifier

Folders and files

Latest commit

History

Repository files navigation

TextThema - Themes Classifier & Recommender NLP System

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Live Website Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages