UltraClean is a fast and efficient Python library for cleaning and preprocessing text data, specifically designed for AI/ML tasks and data processing.
- Remove unwanted characters, links, emails, phone numbers, underscores, unicode characters, emojis, numbers, currencies, punctuation, HTML tags, LaTeX commands, and more.
- Handle multi-dots, extra spaces, and hashtags.
- Batch processing for efficient text cleaning.
- Spam detection and filtering using pre-trained models.
You can install UltraClean using pip:
pip install ultraclean
The cleanup
function provides a comprehensive set of options for cleaning text data. Below is a detailed description of its arguments and usage.
data
(str): The text data to be cleaned.remove_weird_chars
(bool): Remove unwanted characters like newlines, tabs, etc. Default isTrue
.remove_links
(bool): Remove URLs from the text. Default isTrue
.remove_emails
(bool): Remove email addresses from the text. Default isTrue
.remove_phones
(bool): Remove phone numbers from the text. Default isTrue
.remove_underscores
(bool): Remove underscores and other special characters. Default isTrue
.remove_unicode
(bool): Remove or replace certain unicode characters. Default isTrue
.remove_multi_dots
(bool): Replace multiple dots with a single dot. Default isTrue
.remove_extra_spaces
(bool): Remove extra spaces from the text. Default isTrue
.remove_hashtags
(bool): Remove hashtags from the text. Default isTrue
.remove_emojis
(bool): Remove emojis from the text. Default isTrue
.remove_numbers
(bool): Remove numbers from the text. Default isFalse
.remove_currencies
(bool): Remove currency symbols from the text. Default isTrue
.remove_punctuation
(bool): Remove punctuation from the text. Default isFalse
.remove_html
(bool): Remove HTML tags from the text. Default isTrue
.remove_latex
(bool): Remove LaTeX commands from the text. Default isTrue
.
from ultraclean.clean import cleanup
text = "Congratulations! You've won a free trip to Hawaii. Click here to claim your prize. This is not a scam."
cleaned_text = cleanup(text)
print(cleaned_text)
The Spam
class provides methods for detecting and filtering spam text using a pre-trained model.
__init__(self, cache_dir=None, device=None)
: Initialize the spam detector with optional cache directory and device (CPU or GPU).predict(self, text)
: Predict if the given text is spam. ReturnsTrue
if spam, otherwiseFalse
.filter(self, paragraph)
: Filter out spam sentences from a paragraph. Returns the cleaned paragraph.
from ultraclean.predict import Spam
spam_detector = Spam()
text = "Congratulations! You've won a free trip to Hawaii. Click here to claim your prize."
is_spam = spam_detector.predict(text)
print(f"Is the text spam? {'Yes' if is_spam else 'No'}")
paragraph = "Congratulations! You've won a free trip to Hawaii. Click here to claim your prize. This is not a scam."
cleaned_paragraph = spam_detector.filter(paragraph)
print(cleaned_paragraph)
Sample Output:
Is the text spam? Yes
Congratulations! You've won a free trip to Hawaii. This is not a scam.
- Use UltraClean for preprocessing text data before feeding it into AI/ML models.
- Use the
cleanup
function to remove unwanted characters and standardize text data. - Use the
Spam
class to detect and filter out spam content from text data.
- Do not use UltraClean for tasks that require preserving the original formatting of text data.
- Avoid using the
cleanup
function with all options enabled if you need to retain specific types of information (e.g., links, emails).
This project is licensed under the MIT License with attribution requirement.
Ranit Bhowmick • bhowmickranitking@duck.com