Data Extraction and NLP Test Assignment

Problem Statement

Given a set of URLs, the task is to:

Extract articles from the web.
Perform text analysis to compute various variables.
Update the results into Output Data Structure file.

How I Solved this Problem Statement

Step 1: Article Extraction

To extract text from the given URLs:

Used the requests library to fetch webpage content.
Used BeautifulSoup (from bs4) to parse HTML and extract:
- Main headings (<h1> tags).
- Subheadings (<strong> tags).
- Body paragraphs (<p> tags).
Combined these elements to form the full article text, saved to local .txt files for further processing.

Step 2: Text Preprocessing

To prepare the text for analysis:

Removed punctuation and converted the text to lowercase for consistency.
Split the text into sentences and words using Python's re library.

Step 3: Sentiment Analysis

Using pre-defined positive and negative word lists:

Counted the occurrences of positive and negative words in the text.
Calculated:
- Polarity Score: (positive count - negative count) / (positive count + negative count).
- Subjectivity Score: (positive count + negative count) / total words.

Step 4: Calculated each variable from the articles documents

Positive Score
Negative Score
Polarity Score
Subjectivity Score
Average Sentence Length
Percentage of Complex Words
Fog Index
Average Words per Sentence
Complex Word Count
Word Count
Syllables per Word
Personal Pronouns
Average Word Length

Step 5: Integration with CSV

To consolidate results:

Used pandas to read and update the input CSV file.
Mapped extracted articles to their respective URL_ID in the CSV.
Added computed metrics as new columns in the CSV.

How to Run Model.py

Python 3.8+
Install required libraries using:
```
pip install -r requirements.txt
```
Ensure your current directory contains the necessary files and folders as shown in the image below to run model.py:
Run the script:
```
python model.py
```

Project Author

This project was developed by Satyam Kumar for Blackcoffer as part of an internship.

Connect with me:

LinkedIn: linkedin.com/in/isatyamks
GitHub: github.com/isatyamks
Portfolio: isatyamks
Email: isatyamks@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
MasterDictionary		MasterDictionary
articles		articles
Objective.pdf		Objective.pdf
Output Data Structure.csv		Output Data Structure.csv
image.png		image.png
model.py		model.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Extraction and NLP Test Assignment

Problem Statement

How I Solved this Problem Statement

Step 1: Article Extraction

Step 2: Text Preprocessing

Step 3: Sentiment Analysis

Step 4: Calculated each variable from the articles documents

Step 5: Integration with CSV

How to Run Model.py

Project Author

Connect with me:

About

Releases

Packages

Languages

isatyamks/blackcoffer

Folders and files

Latest commit

History

Repository files navigation

Data Extraction and NLP Test Assignment

Problem Statement

How I Solved this Problem Statement

Step 1: Article Extraction

Step 2: Text Preprocessing

Step 3: Sentiment Analysis

Step 4: Calculated each variable from the articles documents

Step 5: Integration with CSV

How to Run Model.py

Project Author

Connect with me:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages