Arxiv34k4l - Multi-label Text Classification Project

Overview

Arxiv34k4l is a project aimed at building a multi-label text classification model using natural language processing (NLP) techniques. The project utilizes data sourced from the ArXiv database, which contains a vast collection of academic papers spanning various disciplines.

Objective

The project's main objective is to develop a model capable of effectively classifying academic papers into multiple categories simultaneously based on their abstracts reducing the workload of human reviewers who are often involved, and automating the process.

Dataset

The dataset used in this project consists of academic papers sourced from the ArXiv. It comprises a diverse range of papers covering topics such as computer science, AI, mathematics, and more. The dataset is preprocessed by me and annotated for multi-label classification, with each paper associated with one or more subject categories. The data collection process is also done and shown here. The dataset Arxiv34k6L contains abstracts and their categories (4 labels types with 34068 rows). It has been limited to these many labels for simplicity purposes, and to avoid highly imbalanced classes which are present in ArXiv's data. However, readers can download and preprocess the data according to their own needs as shown in the collection step.

Methodology

The project employs various NLP techniques and machine learning algorithms to build an effective multi-label classification model. The methodology includes the following steps:

Data Collection: The data used in this project is collected using Arxiv API.
Data Preprocessing: Cleaning and preprocessing the text data for modeling.
Model Selection: Evaluating and selecting appropriate machine learning models for multi-label classification, considering factors such as performance metrics and computational efficiency.
Model Training: Training the selected model(s) on the preprocessed data to learn patterns and associations between input features and target labels.
Evaluation: Assessing the performance of the trained model(s) using metrics such as accuracy, precision, recall, and F1-score.

Details related to the entire project can be found here. Check out the notebook for the results.

Contributors

Amritesh Kumar) - Project Lead & Developer

License

This project is licensed under the MIT License.

Acknowledgments

Thank you to arXiv for the use of its open-access interoperability.
Special thanks to Sayak Paul, and Soumik Rakshit for the inspiration behind the project as this project is inspired by their Large-scale multi-label text classification on Keras.

If you notice any mistake or wish to suggest something, do let me know.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arxiv34k4l - Multi-label Text Classification Project

Table of Contents

Overview

Objective

Dataset

Methodology

Contributors

License

Acknowledgments

About

Releases

Packages

Languages

License

kelixirr/Arxiv34k4l

Folders and files

Latest commit

History

Repository files navigation

Arxiv34k4l - Multi-label Text Classification Project

Table of Contents

Overview

Objective

Dataset

Methodology

Contributors

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages