Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of BoW algorithm Code #57

Merged
merged 3 commits into from
Jun 29, 2024

Conversation

Arushi-N4
Copy link
Contributor

@Arushi-N4 Arushi-N4 commented Jun 27, 2024

Closes: #24

  • Title : Addition of Bag-of-Words Algorithm Code
  • Name: Arushi Nirala
  • Idenitfy yourself: SSOC 2024 Contributor

Describe the add-ons or changes you've made 📃

After learning from online resources, I added code of bag of words (BoW) Algorithm which include

  1. Preprocessing: The preprocess_text function converts text to lowercase and removes non-alphabetic characters.
  2. Bag-of-Words Vectorization: Using CountVectorizer from scikit-learn, it creates a matrix where each row represents a document and each column represents a word in the vocabulary.
  3. Stop Words Removal: Stop words like "the", "is", "are", and "and" are removed during vectorization using stop_words='english'.
  4. Feature Extraction: vectorizer.get_feature_names_out() retrieves the feature names (words) after preprocessing and stop word removal.
  5. DataFrame Representation: The resulting bag-of-words matrix is converted into a Pandas DataFrame (bow_df), where rows are labeled by document numbers and columns by word names.

Type of change ☑️

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code style update (formatting, local variables)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested? ⚙️

I tested the code in Jupyter Notebook and VS Code by running it with sample text data, verifying preprocessing steps, and inspecting the generated bag-of-words dataframe for correctness

Checklist: ☑️

  • My code follows the Contributing Guidelines & Code of Conduct of this project.
  • This PR does not contain plagiarized content.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly wherever it was hard to understand.
  • My changes generate no new warnings.

Screenshots 📷

image

Note to reviewers 📄

Please review my changes and let me know if any further adjustments or changes are needed

Copy link
Contributor

Thank you for submitting your pull request! We'll review it as soon as possible. For further communication, join our discord server https://discord.gg/tSqtvHUJzE.

@Arushi-N4
Copy link
Contributor Author

Arushi-N4 commented Jun 28, 2024

  • I changed the folder name and converted .ipynb into .py file as per your request.

  • Screenshot

image

Kindly review it

Copy link
Owner

@Avdhesh-Varshney Avdhesh-Varshney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arushi-N4 PR Approved 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

📃: Adding Bag of words(BoW) algorithm Under NLP
2 participants