This repository contains a Python-based tool for text preprocessing and categorization, specifically designed to process and clean text files from a sentiment analysis dataset. The program uses NLTK and regular expressions to clean, tokenize, and remove stopwords from the text. The processed data is then saved into categorized folders for further analysis.
- Reads and processes text files from input directories.
- Cleans text by removing punctuation, converting to lowercase, and eliminating stopwords.
- Categorizes processed text into positive and negative sentiment folders.
- Supports batch processing of text files within specified directories.
A:.
├───NEGATIVO
├───POSITIVO
└───review_polarity
└───txt_sentoken
├───neg
└───pos
- Python 3.12 or higher
- Libraries:
nltk
,re
- Clone this repository:
git clone https://github.com/KPlanisphere/text-processing-tool.git
- Install the required dependencies:
pip install nltk
- Download the NLTK stopwords corpus:
import nltk nltk.download('stopwords')
- Define the input and output directories for positive and negative text files:
input_dir1 = r'path_to_positive_files' input_dir2 = r'path_to_negative_files' output_dir1 = r'path_to_output_positive_files' output_dir2 = r'path_to_output_negative_files'
- Run the script:
python lab9.py
- Input Directory:
review_polarity/txt_sentoken/
- Positive:
pos
- Negative:
neg
- Positive:
- Output Directory:
- Processed positive files:
POSITIVO/
- Processed negative files:
NEGATIVO/
- Processed positive files:
- File Loading: The script reads text files from the specified directories.
- Text Processing:
- Removes punctuation.
- Converts all text to lowercase.
- Filters out English stopwords.
- Output: The cleaned text is saved into respective directories (e.g., POSITIVO/NEGATIVO).
The main script that:
- Processes files from the sentiment dataset.
- Categorizes and saves cleaned text.