Skip to content

Natural LangWiz is a repository for exploring Natural Language Processing (NLP) techniques through Jupyter notebooks. It covers everything from text preprocessing and sentiment analysis to advanced transformer models. Dive in to see how we turn raw text into actionable insights with a touch of NLP wizardry!

Notifications You must be signed in to change notification settings

Asifdotexe/Natural-LangWiz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural LangWiz

Welcome to the Natural LangWiz repository! Here, we perform a bit of language wizardry to make text data magically understandable for machines. With our collection of Jupyter notebooks, we delve into various aspects of Natural Language Processing (NLP), offering detailed explanations and hands-on examples.

Think of us as modern-day language wizards, transforming raw text into structured data and insightful information—no magic wand required!

Table of Contents

  1. Data Preprocessing
  2. Web Scraping
  3. Word Cloud
  4. Emojification
  5. Sentiment Analysis
  6. Named Entity Recognition
  7. Similarity Checking
  8. Spam Detection
  9. Transformer Models
  10. Translation
  11. Vectorization
  12. API Calling
  13. Grammar Checking
  14. N-Grams
  15. Demojification
  16. Python Gemini Integration
  17. Topic Modelling

Data Preprocessing

Data preprocessing is a crucial step in NLP to clean and prepare text data for analysis and modeling. The following preprocessing steps are covered in the Data Preprocessing Notebook:

Text Cleaning

Using regular expressions (regex), unwanted characters and patterns are removed from the text to make it clean and uniform.

Converting Text to Lowercase

Converts all characters in the text to lowercase to ensure uniformity and avoid case sensitivity issues during analysis.

Removing Whitespace and Non-Textual Characters

Removes unnecessary whitespace and non-textual characters to streamline the text.

Removing Digits

Digits are removed from the text to focus on the textual content.

Tokenization

Splits the text into individual words or tokens, which are the basic units for further NLP tasks.

Stemming

Reduces words to their base or root form by removing suffixes. For example, "running" becomes "run".

Lemmatization

Similar to stemming, but more sophisticated. It reduces words to their dictionary form. For example, "running" becomes "run" and "better" becomes "good".

Part of Speech Tagging

Identifies and labels the part of speech (e.g., noun, verb, adjective) for each token in the text.

Web Scraping

Web scraping is the process of extracting data from websites. The following web scraping tasks are covered in the Web Scraping Notebook:

Wikipedia Scraping using Beautiful Soup

Extracts data from Wikipedia pages using the Beautiful Soup library.

Amazon Scraping using Beautiful Soup

Extracts product data from Amazon using the Beautiful Soup library.

Word Cloud

A word cloud is a visual representation of text data, where the size of each word indicates its frequency or importance. The Word Cloud Notebook demonstrates how to create a word cloud from a given corpus.

Emojification

Emojification involves handling emojis in text data, either by removing them or replacing them with corresponding text. The following tasks are covered in the Emojification Notebook:

Removing Emojis

Uses the demoji library to identify and remove emojis from the text.

Replacing Emojis with Text

Uses the emoji library to replace emojis with their corresponding text descriptions.

Sentiment Analysis

Sentiment analysis determines the sentiment or emotional tone of a piece of text. The following notebooks cover different approaches:

AFINN Sentiment Analysis

The AFINN Sentiment Analysis Notebook uses the AFINN lexicon to classify sentiment into positive, negative, or neutral.

General Sentiment Analysis

The General Sentiment Analysis Notebook covers broader sentiment analysis techniques and models.

Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies key entities in text, such as names of people, organizations, and locations. The Named Entity Recognition Notebook demonstrates how to recognize and classify entities using NER techniques.

Similarity Checking

Similarity checking involves determining how similar two pieces of text are. The Similarity Checker Notebook explores various methods to compute textual similarity.

Spam Detection

Spam detection identifies whether a piece of text is spam or not. The Spam Detection Notebook covers techniques for classifying text as spam or non-spam.

Transformer Models

Transformer models are advanced neural network architectures for NLP tasks. The following notebooks cover different applications:

Text Summarization

The Text Summarization Notebook demonstrates how to summarize text using transformer models.

Text Generation

The Text Generation Notebook showcases generating coherent and contextually relevant text with transformer models.

Emotion Analysis

The Emotion Analysis Notebook showcases sentimental analysis using transformer models

Translation

The Translation Notebook covers techniques for translating text between different languages.

Vectorization

Vectorization converts text into numerical representations. The Vectorization Notebook explains different vectorization techniques, such as Bag of Words and TF-IDF.

API Calling

The API Calling Notebook demonstrates how to interact with external APIs to retrieve and manipulate text data.

Grammar Checking

The Grammar Checking Notebook covers techniques for identifying and correcting grammatical errors in text.

N-Grams

The N-Grams Notebook explains the concept of n-grams and their use in text analysis and modeling.

Demojification

Demojification involves handling emojis in text data, either by removing or replacing them. For more details, refer to the Demojification Notebook.

Python Gemini Integration

Python Gemini Notebook

The Python Gemini Notebook contains the code for using Gemini through Python.

Gemini TKinter Script

The Gemini TKinter Script contains the script to run Gemini through Python using an interface.

Gemini Chat Interface

Topic Modelling

The Topic Modelling Notebook contains explanation and code for topic modelling, where we have used LDA (Latent Dirichlet Allocation) to discover topics within the corpus and also performed visualization using pyLDAviz.


Feel free to explore the notebooks and enhance your understanding of basic NLP concepts. If you have any questions or suggestions, please open an issue or submit a pull request.

Happy Learning!

About

Natural LangWiz is a repository for exploring Natural Language Processing (NLP) techniques through Jupyter notebooks. It covers everything from text preprocessing and sentiment analysis to advanced transformer models. Dive in to see how we turn raw text into actionable insights with a touch of NLP wizardry!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published