Skip to content

KPlanisphere/text-processing-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-Processing-Tool

Overview

This repository contains a Python-based tool for text preprocessing and categorization, specifically designed to process and clean text files from a sentiment analysis dataset. The program uses NLTK and regular expressions to clean, tokenize, and remove stopwords from the text. The processed data is then saved into categorized folders for further analysis.

Features

  • Reads and processes text files from input directories.
  • Cleans text by removing punctuation, converting to lowercase, and eliminating stopwords.
  • Categorizes processed text into positive and negative sentiment folders.
  • Supports batch processing of text files within specified directories.

File Structure

A:.
├───NEGATIVO
├───POSITIVO
└───review_polarity
    └───txt_sentoken
        ├───neg
        └───pos

Requirements

  • Python 3.12 or higher
  • Libraries: nltk, re

Installation

  1. Clone this repository:
    git clone https://github.com/KPlanisphere/text-processing-tool.git
  2. Install the required dependencies:
    pip install nltk
  3. Download the NLTK stopwords corpus:
    import nltk
    nltk.download('stopwords')

Usage

  1. Define the input and output directories for positive and negative text files:
    input_dir1 = r'path_to_positive_files'
    input_dir2 = r'path_to_negative_files'
    
    output_dir1 = r'path_to_output_positive_files'
    output_dir2 = r'path_to_output_negative_files'
  2. Run the script:
    python lab9.py

Example

  • Input Directory: review_polarity/txt_sentoken/
    • Positive: pos
    • Negative: neg
  • Output Directory:
    • Processed positive files: POSITIVO/
    • Processed negative files: NEGATIVO/

How It Works

  1. File Loading: The script reads text files from the specified directories.
  2. Text Processing:
    • Removes punctuation.
    • Converts all text to lowercase.
    • Filters out English stopwords.
  3. Output: The cleaned text is saved into respective directories (e.g., POSITIVO/NEGATIVO).

Scripts

lab9.py

The main script that:

  • Processes files from the sentiment dataset.
  • Categorizes and saves cleaned text.