Given a set of URLs, the task is to:
- Extract articles from the web.
- Perform text analysis to compute various variables.
- Update the results into Output Data Structure file.
To extract text from the given URLs:
- Used the
requests
library to fetch webpage content. - Used
BeautifulSoup
(frombs4
) to parse HTML and extract:- Main headings (
<h1>
tags). - Subheadings (
<strong>
tags). - Body paragraphs (
<p>
tags).
- Main headings (
- Combined these elements to form the full article text, saved to local
.txt
files for further processing.
To prepare the text for analysis:
- Removed punctuation and converted the text to lowercase for consistency.
- Split the text into sentences and words using Python's
re
library.
Using pre-defined positive and negative word lists:
- Counted the occurrences of positive and negative words in the text.
- Calculated:
- Polarity Score:
(positive count - negative count) / (positive count + negative count)
. - Subjectivity Score:
(positive count + negative count) / total words
.
- Polarity Score:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Percentage of Complex Words
- Fog Index
- Average Words per Sentence
- Complex Word Count
- Word Count
- Syllables per Word
- Personal Pronouns
- Average Word Length
To consolidate results:
- Used
pandas
to read and update the input CSV file. - Mapped extracted articles to their respective
URL_ID
in the CSV. - Added computed metrics as new columns in the CSV.
- Python 3.8+
- Install required libraries using:
pip install -r requirements.txt
- Ensure your current directory contains the necessary files and folders as shown in the image below to run
model.py
: - Run the script:
python model.py
This project was developed by Satyam Kumar for Blackcoffer as part of an internship.
- LinkedIn: linkedin.com/in/isatyamks
- GitHub: github.com/isatyamks
- Portfolio: isatyamks
- Email: isatyamks@gmail.com