Python Text and File Processing Utilities

A comprehensive Python library providing utilities for text processing, file manipulation, data extraction, and various file format handling. The toolkit includes functionality for string operations, file splitting, log processing, and specialized data format handling.

Features

Text and File Processing

String search and position finding with support for regular expressions
File splitting with customizable size and format options
Log file processing with timestamp and message extraction
Indentation conversion between 2 and 4 spaces
String replacement across multiple files with backup creation
Excel column letter/header conversion with support for complex spreadsheets
HDF5 file analysis and variable extraction
Empty column filtering for spreadsheet optimization

Installation

Clone this repository:

git clone [repository-url]

Install required dependencies:

pip install numpy pandas h5py xlsxwriter

Components

Text Processing (string_manipulator.py)

The Text class provides core text manipulation and file operations with support for various file formats and encoding types.

from string_manipulator import Text

# Initialize with text content
text = Text(input_text)

# Find occurrences and positions of a string
count, positions = text.find_string("search_term")
# Returns: (3, [0, 45, 72]) # Example: 3 occurrences at positions 0, 45, and 72

# Split files with custom configuration
text.divide_by_lines(
    No_lines=1000,  # Total number of lines in input file
    divfiles=10,    # Number of output files desired
    folder='output',# Output directory
    ext='dat'       # Output file extension
)

# Split by maximum lines per file
text.split_by_lines(
    divlines=100,   # Maximum lines per output file
    ext='txt',      # Output file extension
    folder='output' # Output directory
)

# Split by file size with custom naming
text.split_by_size(
    size=1024,      # Size in bytes
    ext='txt',      # Output file extension
    folder='output',# Output directory
    fname='split'   # Base filename for output files
)

Indentation Converter (indent_converter.py)

Convert between 2-space and 4-space indentation in Python files with backup creation and error handling.

from indent_converter import convert_indentation

# Convert to 2 spaces
convert_indentation(
    "path/to/file.py",
    to_two_spaces=True  # False for 4-space conversion
)

# Error handling example
try:
    convert_indentation("path/to/file.py", to_two_spaces=True)
except Exception as err:
    print(f"Error processing file: {str(err)}")

Configuration Options:

to_two_spaces: Boolean flag for conversion direction
Automatic backup creation with .bak extension
Preserves empty lines and comments
Handles mixed indentation gracefully

Excel Column Utilities (letters_and_headers.py)

Comprehensive utilities for working with Excel column letters and headers, supporting complex spreadsheet operations.

from letters_and_headers import (
    getLetters, 
    getHeaders, 
    colLetter_to_Num,
    colHeader_to_Letter,
    stringOfLettersList
)

# Convert headers to Excel column letters
letters = getLetters(['Name', 'Age'], dataframe)
# Returns: {'Name': 'A', 'Age': 'B'}

# Convert letters to headers
headers = getHeaders(['A', 'B'], dataframe)
# Returns: {'A': 'Name', 'B': 'Age'}

# Convert column letter to number (zero-based)
column_num = colLetter_to_Num('AA')
# Returns: 26 (zero-based index)

# Get formatted string of column letters
letters_str = stringOfLettersList(['Name', 'Age'], dataframe)
# Returns: "A, B"

# Convert multiple headers to letter format
column_letters = colHeader_to_Letter(['Name', 'Age'], dataframe)
# Returns: "A, B"

HDF5 File Processing (hdf5.py)

Advanced tools for analyzing and processing HDF5 files with support for complex data structures and variable types.

from hdf5 import HDF5

# Initialize HDF5 processor
h5_file = HDF5("path/to/file.h5")

# Find variables in HDF5 file
var_locations = h5_file.findVar('variable_name')
# Returns: {'dataset_name': [0, 2, 5]} # Indices where variable appears

# Generate complete data sets
h5_dict, h5_varNames, varNames = h5_file.genSets()

# Access specific variables
if 'Time' in varNames:
    time_locations = h5_file.findVar('Time')
    print(f"Time variable found in: {time_locations}")

Key Features:

Support for complex HDF5 data structures
Efficient variable search and extraction
Comprehensive data set generation
Error handling for invalid HDF5 files
Memory-efficient processing of large files

Empty Column Filter (empty_column_filter.py)

Identify and filter empty or single-value columns in spreadsheets with detailed analysis capabilities.

# Configuration
ms = "Master_1_052024_GOLDEN_RUNS.xlsx"
df = pd.read_excel(ms, sheet_name="Main Sheet", skiprows=0)

# Customization options
df.fillna(value="NULL_VALUE", inplace=True)  # Custom null value marker
headers = df.columns

# Results processing
selected_headers = []  # Headers with 0 or 1 unique values
lengths = []          # Number of unique values

# Output format in results.txt:
# Column_Name    Unique_Values_Count
# Date          1
# Status        0

Features:

Custom null value handling
Support for multi-sheet Excel files
Handles newline characters in headers
Detailed analysis output
Memory-efficient processing

String Replacement

The library provides both Python and Bash implementations for replacing strings across multiple files in a directory.

Bash Implementation

Create a file called replace_string.sh:

Python Implementation

Create a file called replace_string.py:

import os
import fileinput

def replace_string_in_files(old_string, new_string, directory):
    """
    Replace strings in all files within a directory.
    
    Args:
        old_string (str): String to replace
        new_string (str): Replacement string
        directory (str): Target directory path
        
    Features:
        - Recursive directory traversal
        - Automatic backup creation
        - UTF-8 encoding support
        - Error handling
    """
    for root, dirs, files in os.walk(directory):
        for filename in files:
            try:
                file_path = os.path.join(root, filename)
                with fileinput.FileInput(
                    file_path, 
                    inplace=True, 
                    backup='.bak'
                ) as file:
                    for line in file:
                        print(line.replace(old_string, new_string), end='')
            except Exception as e:
                print(f"Error processing {filename}: {str(e)}")

if __name__ == "__main__":
    import sys
    if len(sys.argv) != 4:
        print("Usage: python replace_string.py OLD_STRING NEW_STRING DIRECTORY")
        sys.exit(1)

    old_string = sys.argv[1]
    new_string = sys.argv[2]
    directory = sys.argv[3]

    replace_string_in_files(old_string, new_string, directory)

Key Features:

Recursive directory traversal
Automatic backup creation (.bak files)
UTF-8 encoding support
Comprehensive error handling
Progress feedback

Usage:

python replace_string.py "old_text" "new_text" /path/to/directory

#!/bin/bash

# Usage: ./replace_string.sh "old_string" "new_string" /path/to/directory

OLD_STRING=$1
NEW_STRING=$2
DIRECTORY=$3

# Check if all arguments are provided
if [ -z "$OLD_STRING" ] || [ -z "$NEW_STRING" ] || [ -z "$DIRECTORY" ]; then
  echo "Usage: $0 OLD_STRING NEW_STRING DIRECTORY"
  exit 1
fi

# Find and replace
find "$DIRECTORY" -type f -exec sed -i '' "s/$OLD_STRING/$NEW_STRING/g" {} +

echo "Replacement completed."

Key Features:

Argument validation
Directory-wide search
In-place file modification
Progress feedback

Usage:

./replace_string.sh "old_text" "new_text" /path/to/directory

Log Processing (ExSpread class)

Advanced log file processing with support for various formats and data extraction patterns.

from string_manipulator import ExSpread

# Timestamp extraction
log_processor = ExSpread(
    "path/to/log.ascii_out",
    "search_string"
)
log_processor.mk_timesheet(
    folder='output',
    fname='timestamps'
)

# URN message processing
urn_processor = ExSpread(
    "path/to/log.ascii_out",
    "696683"  # URN as string or integer
)
urn_processor.find_URN_messages(
    folder='output',
    fname='urn_messages'
)

File Format Requirements

HDF5 Files

Must be valid HDF5 format
Should contain variable names matching search patterns
Requires appropriate read permissions
File size handling depends on available memory

Excel Files

Supports .xlsx, .xls, and .csv formats
Headers must be in first row unless specified
Column names should be unique
Special characters in headers are supported
Large files may require batch processing

Log Files

UTF-8 encoding required
Timestamp format: "YYYY-MM-DD HH:MM:SS"
URN messages must follow specified format
Line endings must be consistent

Text Files

UTF-8 encoding recommended
Line endings must be consistent
File permissions must allow read/write
Backup space required for operations

Limitations

Text Processing

Memory constraints for very large files
Regular expression performance varies with pattern complexity
Backup creation requires additional storage space
Some operations are not atomic

Excel Processing

Column limit of 16,384 (XLS) or 16,777,216 (XLSX)
Memory usage scales with file size
Some formula types may not be preserved
Conditional formatting may be lost

HDF5 Processing

Limited support for custom data types
Memory requirements scale with dataset size
Nested group depth may affect performance
Some metadata may not be preserved

Log Processing

Specific format requirements
Time zone handling limitations
Performance degrades with file size
Limited support for compressed logs

Error Handling

The library implements comprehensive error handling:

File Operations:
- FileNotFoundError for missing files
- PermissionError for access issues
- Encoding errors for text files
Data Processing:
- ValueError for invalid data
- TypeError for mismatched data types
- Memory errors for large datasets
Format-Specific:
- HDF5 format validation
- Excel worksheet validation
- Log format verification
Recovery:
- Automatic backup creation
- Transaction rollback where applicable
- Error logging and reporting

Contributing

Detailed contribution guidelines:

Code Style:
- Follow PEP 8
- Include docstrings
- Add type hints
- Write unit tests
Pull Requests:
- Create feature branch
- Add tests
- Update documentation
- Follow commit message format
Testing:
- Run existing test suite
- Add new test cases
- Include edge cases
- Test with large datasets

License

This project is licensed under the MIT License. See the LICENSE file for details.

Notes

Development Considerations:

Performance optimization prioritizes reliability
Memory management focuses on large file handling
Error handling emphasizes data preservation
Backup creation may impact storage requirements
Cross-platform compatibility may vary
Some operations require administrative privileges
logging module integration recommended
Consider batch processing for large datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Text and File Processing Utilities

Features

Text and File Processing

Installation

Components

Text Processing (string_manipulator.py)

Indentation Converter (indent_converter.py)

Excel Column Utilities (letters_and_headers.py)

HDF5 File Processing (hdf5.py)

Empty Column Filter (empty_column_filter.py)

String Replacement

Bash Implementation

Python Implementation

Log Processing (ExSpread class)

File Format Requirements

HDF5 Files

Excel Files

Log Files

Text Files

Limitations

Text Processing

Excel Processing

HDF5 Processing

Log Processing

Error Handling

Contributing

License

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
LICENSE		LICENSE
README.md		README.md
empty_column_filter.py		empty_column_filter.py
file_processor.py		file_processor.py
file_processor_manual.md		file_processor_manual.md
hdf5.py		hdf5.py
indent_converter.py		indent_converter.py
letters_and_headers.py		letters_and_headers.py

License

spyderkam/string-manipulator

Folders and files

Latest commit

History

Repository files navigation

Python Text and File Processing Utilities

Features

Text and File Processing

Installation

Components

Text Processing (string_manipulator.py)

Indentation Converter (indent_converter.py)

Excel Column Utilities (letters_and_headers.py)

HDF5 File Processing (hdf5.py)

Empty Column Filter (empty_column_filter.py)

String Replacement

Bash Implementation

Python Implementation

Log Processing (ExSpread class)

File Format Requirements

HDF5 Files

Excel Files

Log Files

Text Files

Limitations

Text Processing

Excel Processing

HDF5 Processing

Log Processing

Error Handling

Contributing

License

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages