Skip to content

Latest commit

 

History

History
34 lines (19 loc) · 2.56 KB

README.md

File metadata and controls

34 lines (19 loc) · 2.56 KB

News Summarizer App using Python

News Summarizer App using Python and newspaper3k to scrape and extract the summary of news data from a given URL using requests and transform and load the extracted data using WTforms and Flask

View Demo

About the Project

About the Project

Previously I built a simple News Scraper APP on the web using Python to scrape the latest news from a specific news site using Beautiful Soup and Flask.

This time, I built slightly more advanced version of the app to scrape news data from a news article using Python packages newspaper3k, then deployed the app using Flask and on Google App Engine

First of all, when the URL link form captures the URL link of a news article, the newspaper3k package will extract and parse the data of the article. If the form entry is not for a valid URL of a news site, the error message will appear. For form input handling and validation, I used WTForms and requests libraries to grab the URL link entered in the form. Then, from the data extracted I extract following data to render on the first part of my result page:

Title
Published date
Author
Top image (source link)

(*Please note WordCloud is currently disabled due to image storage issue)

At the same time, using the full text of the article extracted, my app also generates WordCloud for the news article. The WordCloud on the result page will display the words that are the most frequent among the news text extracted.io library is used to keep the WordCloud image in memory and base64 to convert the resulting bytes to base64 in order to return the image as part of our HTML response and render the image.

Lastly newspaper3k can also run its simple natural language processing to extract keywords from the news and also produce the summary of the article text

Keywords (WordCloud image)
Summary

Keywords(WorldCloud) image and the summary of the news text will be displayed as the second part of the result page.