Skip to content

News scraping app that scrapes news articles from a web site -- Node, Express, MongoDB

Notifications You must be signed in to change notification settings

mmakino/NewsHeadlines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

All the News That's Fit to Scrape

News Headlines -- EE Times edition

Overview

  • News Headlines is a news scraping app that scrapes news articles from a web site and display on the main page. In this edition, the app will fetch articles from EE Times and displays the followings:

    • Headline - the title of the article
    • Summary - a short summary of the article
    • Link - the url link to the original article
    • Image - an image of the article
  • Along with each article, there are Delete and Comment buttons.

    • Delete button deletes the article. [Delete from the internal database]
    • Comment button lets a user add comment(s) to the article.
      • Post Comment form is available at the bottom of page.
      • On the comment page, any existing comments are listed.
      • A new comment will be appended toward the bottom.

Objectives

  • Implement the app with MVC Restful API using Node.js, Express, and Handlebars
  • Utilize MongoDB and Mongoose ODM including multiple collections and embedded documents
  • Scrape an external web site for news articles using cheerio and axios

Deployment and Availability

Intallation

  • This full stack app can also be installed locally through the following steps
  1. Clone the git repository
    git clone https://github.com/mmakino/NewsHeadlines.git
    
  2. Install necessary packages
    npm install
    
    • This app uses the following NPM packages:
    "axios": "^0.18.0",
    "cheerio": "^1.0.0-rc.2",
    "express": "^4.16.4",
    "express-handlebars": "^3.0.1",
    "mongoose": "^5.4.15"
    
  3. MongoDB database
    • MongoDB server mongod needs to be up and running with all CRUD privileges.
    • models/index.js includes the default setup
      • mongodb://localhost/newsHeadlines
    models/
    └── index.js
    
    • Default setting
      • HOST: localhost
      • PORT: 3003
  4. Start the web server
    npm run server
    
    • It should display the following message when the server has started successfully
      ...
      ...
      App running on port 3003!
      Connected to MongoDB mongodb://localhost/newsHeadlines
      
  5. Open the web page in a browser by entering the following URL into the address bar.
    http://localhost:3003/
    

Limitation and Potential issue

  • EE Times http://www.eetimes.com appears to be anti-scraping site
  • Local server on your computer
  • Heroku deployment
    • The User-Agent spoofing workaround does not seem to work unfortunately.
    • The request gets code=H12 desc="Request timeout"

Demo

  • The main page displays scraped articles from EE Times

    main page

  • Pressing the Delete button of each article will delete the article.

  • The Scrape button on the navbar on top will fetch and (re-)populate articles.

  • Pressing the Comment button of each article will take a user to an individual article page with comments (if already any). A user can post a comment using the Post Comment form at the bottom of the page.

    article page

  • A comment can be deleted by pressing the Delete button for each comment.

    comment

  • The Commented button on the navbar will display only articles that have been commented.

Written by Motohiko Makino

About

News scraping app that scrapes news articles from a web site -- Node, Express, MongoDB

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published