Welcome to the Natural LangWiz repository! Here, we perform a bit of language wizardry to make text data magically understandable for machines. With our collection of Jupyter notebooks, we delve into various aspects of Natural Language Processing (NLP), offering detailed explanations and hands-on examples.
Think of us as modern-day language wizards, transforming raw text into structured data and insightful information—no magic wand required!
- Data Preprocessing
- Web Scraping
- Word Cloud
- Emojification
- Sentiment Analysis
- Named Entity Recognition
- Similarity Checking
- Spam Detection
- Transformer Models
- Translation
- Vectorization
- API Calling
- Grammar Checking
- N-Grams
- Demojification
- Python Gemini Integration
- Topic Modelling
Data preprocessing is a crucial step in NLP to clean and prepare text data for analysis and modeling. The following preprocessing steps are covered in the Data Preprocessing Notebook:
Using regular expressions (regex), unwanted characters and patterns are removed from the text to make it clean and uniform.
Converts all characters in the text to lowercase to ensure uniformity and avoid case sensitivity issues during analysis.
Removes unnecessary whitespace and non-textual characters to streamline the text.
Digits are removed from the text to focus on the textual content.
Splits the text into individual words or tokens, which are the basic units for further NLP tasks.
Reduces words to their base or root form by removing suffixes. For example, "running" becomes "run".
Similar to stemming, but more sophisticated. It reduces words to their dictionary form. For example, "running" becomes "run" and "better" becomes "good".
Identifies and labels the part of speech (e.g., noun, verb, adjective) for each token in the text.
Web scraping is the process of extracting data from websites. The following web scraping tasks are covered in the Web Scraping Notebook:
Extracts data from Wikipedia pages using the Beautiful Soup library.
Extracts product data from Amazon using the Beautiful Soup library.
A word cloud is a visual representation of text data, where the size of each word indicates its frequency or importance. The Word Cloud Notebook demonstrates how to create a word cloud from a given corpus.
Emojification involves handling emojis in text data, either by removing them or replacing them with corresponding text. The following tasks are covered in the Emojification Notebook:
Uses the demoji
library to identify and remove emojis from the text.
Uses the emoji
library to replace emojis with their corresponding text descriptions.
Sentiment analysis determines the sentiment or emotional tone of a piece of text. The following notebooks cover different approaches:
The AFINN Sentiment Analysis Notebook uses the AFINN lexicon to classify sentiment into positive, negative, or neutral.
The General Sentiment Analysis Notebook covers broader sentiment analysis techniques and models.
Named Entity Recognition (NER) identifies and classifies key entities in text, such as names of people, organizations, and locations. The Named Entity Recognition Notebook demonstrates how to recognize and classify entities using NER techniques.
Similarity checking involves determining how similar two pieces of text are. The Similarity Checker Notebook explores various methods to compute textual similarity.
Spam detection identifies whether a piece of text is spam or not. The Spam Detection Notebook covers techniques for classifying text as spam or non-spam.
Transformer models are advanced neural network architectures for NLP tasks. The following notebooks cover different applications:
The Text Summarization Notebook demonstrates how to summarize text using transformer models.
The Text Generation Notebook showcases generating coherent and contextually relevant text with transformer models.
The Emotion Analysis Notebook showcases sentimental analysis using transformer models
The Translation Notebook covers techniques for translating text between different languages.
Vectorization converts text into numerical representations. The Vectorization Notebook explains different vectorization techniques, such as Bag of Words and TF-IDF.
The API Calling Notebook demonstrates how to interact with external APIs to retrieve and manipulate text data.
The Grammar Checking Notebook covers techniques for identifying and correcting grammatical errors in text.
The N-Grams Notebook explains the concept of n-grams and their use in text analysis and modeling.
Demojification involves handling emojis in text data, either by removing or replacing them. For more details, refer to the Demojification Notebook.
The Python Gemini Notebook contains the code for using Gemini through Python.
The Gemini TKinter Script contains the script to run Gemini through Python using an interface.
The Topic Modelling Notebook contains explanation and code for topic modelling, where we have used LDA (Latent Dirichlet Allocation) to discover topics within the corpus and also performed visualization using pyLDAviz.
Feel free to explore the notebooks and enhance your understanding of basic NLP concepts. If you have any questions or suggestions, please open an issue or submit a pull request.
Happy Learning!