All the News That's Fit to Scrape

News Headlines -- EE Times edition

Overview

News Headlines is a news scraping app that scrapes news articles from a web site and display on the main page. In this edition, the app will fetch articles from EE Times and displays the followings:
- Headline - the title of the article
- Summary - a short summary of the article
- Link - the url link to the original article
- Image - an image of the article
Along with each article, there are Delete and Comment buttons.
- Delete button deletes the article. [Delete from the internal database]
- Comment button lets a user add comment(s) to the article.
  - Post Comment form is available at the bottom of page.
  - On the comment page, any existing comments are listed.
  - A new comment will be appended toward the bottom.

Objectives

Implement the app with MVC Restful API using Node.js, Express, and Handlebars
Utilize MongoDB and Mongoose ODM including multiple collections and embedded documents
Scrape an external web site for news articles using cheerio and axios

Deployment and Availability

Heroku
- Live app - https://stormy-cove-58118.herokuapp.com/
  - Scraping does not work on heroku. Please see Limitation and Potential issue
GitHub
- Repository - https://github.com/mmakino/NewsHeadlines

Intallation

This full stack app can also be installed locally through the following steps

Clone the git repository

git clone https://github.com/mmakino/NewsHeadlines.git

Install necessary packages

npm install

This app uses the following NPM packages:

"axios": "^0.18.0",
"cheerio": "^1.0.0-rc.2",
"express": "^4.16.4",
"express-handlebars": "^3.0.1",
"mongoose": "^5.4.15"

MongoDB database
- MongoDB server mongod needs to be up and running with all CRUD privileges.
- models/index.js includes the default setup
  - mongodb://localhost/newsHeadlines
```
models/
└── index.js
```
- Default setting
  - HOST: localhost
  - PORT: 3003

Start the web server

npm run server

It should display the following message when the server has started successfully

...
...
App running on port 3003!
Connected to MongoDB mongodb://localhost/newsHeadlines

Open the web page in a browser by entering the following URL into the address bar.
```
http://localhost:3003/
```

Limitation and Potential issue

EE Times http://www.eetimes.com appears to be anti-scraping site
Local server on your computer
- Most likely and hopefully it works
- As of 2019-12-09, the web site still seems to allow a few spoofed User-Agents.
  - See implementation details in routes/api/scrape.js
- The following article helped me get around from being blocked.
  - https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
Heroku deployment
- The User-Agent spoofing workaround does not seem to work unfortunately.
- The request gets code=H12 desc="Request timeout"

Demo

The main page displays scraped articles from EE Times
Pressing the Delete button of each article will delete the article.
The Scrape button on the navbar on top will fetch and (re-)populate articles.
Pressing the Comment button of each article will take a user to an individual article page with comments (if already any). A user can post a comment using the Post Comment form at the bottom of the page.
A comment can be deleted by pressing the Delete button for each comment.
The Commented button on the navbar will display only articles that have been commented.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
models		models
public		public
routes		routes
views		views
.gitignore		.gitignore
README.md		README.md
homework_instructions.md		homework_instructions.md
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

All the News That's Fit to Scrape

News Headlines -- EE Times edition

Overview

Objectives

Deployment and Availability

Intallation

Limitation and Potential issue

Demo

Written by Motohiko Makino

About

Releases

Packages

Languages

mmakino/NewsHeadlines

Folders and files

Latest commit

History

Repository files navigation

All the News That's Fit to Scrape

News Headlines -- EE Times edition

Overview

Objectives

Deployment and Availability

Intallation

Limitation and Potential issue

Demo

Written by Motohiko Makino

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages