Arxiv34k4l is a project aimed at building a multi-label text classification model using natural language processing (NLP) techniques. The project utilizes data sourced from the ArXiv database, which contains a vast collection of academic papers spanning various disciplines.
The project's main objective is to develop a model capable of effectively classifying academic papers into multiple categories simultaneously based on their abstracts reducing the workload of human reviewers who are often involved, and automating the process.
The dataset used in this project consists of academic papers sourced from the ArXiv. It comprises a diverse range of papers covering topics such as computer science, AI, mathematics, and more. The dataset is preprocessed by me and annotated for multi-label classification, with each paper associated with one or more subject categories. The data collection process is also done and shown here. The dataset Arxiv34k6L contains abstracts and their categories (4 labels types with 34068 rows). It has been limited to these many labels for simplicity purposes, and to avoid highly imbalanced classes which are present in ArXiv's data. However, readers can download and preprocess the data according to their own needs as shown in the collection step.
The project employs various NLP techniques and machine learning algorithms to build an effective multi-label classification model. The methodology includes the following steps:
- Data Collection: The data used in this project is collected using Arxiv API.
- Data Preprocessing: Cleaning and preprocessing the text data for modeling.
- Model Selection: Evaluating and selecting appropriate machine learning models for multi-label classification, considering factors such as performance metrics and computational efficiency.
- Model Training: Training the selected model(s) on the preprocessed data to learn patterns and associations between input features and target labels.
- Evaluation: Assessing the performance of the trained model(s) using metrics such as accuracy, precision, recall, and F1-score.
Details related to the entire project can be found here. Check out the notebook for the results.
- Amritesh Kumar) - Project Lead & Developer
This project is licensed under the MIT License.
- Thank you to arXiv for the use of its open-access interoperability.
- Special thanks to Sayak Paul, and Soumik Rakshit for the inspiration behind the project as this project is inspired by their Large-scale multi-label text classification on Keras.
If you notice any mistake or wish to suggest something, do let me know.