Skip to content

Multi-label text classification project on Arxiv data using NLP

License

Notifications You must be signed in to change notification settings

kelixirr/Arxiv34k4l

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Arxiv34k4l - Multi-label Text Classification Project

Table of Contents

Overview

Arxiv34k4l is a project aimed at building a multi-label text classification model using natural language processing (NLP) techniques. The project utilizes data sourced from the ArXiv database, which contains a vast collection of academic papers spanning various disciplines.

Objective

The project's main objective is to develop a model capable of effectively classifying academic papers into multiple categories simultaneously based on their abstracts reducing the workload of human reviewers who are often involved, and automating the process.

Dataset

The dataset used in this project consists of academic papers sourced from the ArXiv. It comprises a diverse range of papers covering topics such as computer science, AI, mathematics, and more. The dataset is preprocessed by me and annotated for multi-label classification, with each paper associated with one or more subject categories. The data collection process is also done and shown here. The dataset Arxiv34k6L contains abstracts and their categories (4 labels types with 34068 rows). It has been limited to these many labels for simplicity purposes, and to avoid highly imbalanced classes which are present in ArXiv's data. However, readers can download and preprocess the data according to their own needs as shown in the collection step.

Methodology

The project employs various NLP techniques and machine learning algorithms to build an effective multi-label classification model. The methodology includes the following steps:

  1. Data Collection: The data used in this project is collected using Arxiv API.
  2. Data Preprocessing: Cleaning and preprocessing the text data for modeling.
  3. Model Selection: Evaluating and selecting appropriate machine learning models for multi-label classification, considering factors such as performance metrics and computational efficiency.
  4. Model Training: Training the selected model(s) on the preprocessed data to learn patterns and associations between input features and target labels.
  5. Evaluation: Assessing the performance of the trained model(s) using metrics such as accuracy, precision, recall, and F1-score.

Details related to the entire project can be found here. Check out the notebook for the results.

Contributors

License

This project is licensed under the MIT License.

Acknowledgments

If you notice any mistake or wish to suggest something, do let me know.

About

Multi-label text classification project on Arxiv data using NLP

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published