-
News Headlines is a news scraping app that scrapes news articles from a web site and display on the main page. In this edition, the app will fetch articles from
EE Times
and displays the followings:- Headline - the title of the article
- Summary - a short summary of the article
- Link - the url link to the original article
- Image - an image of the article
-
Along with each article, there are
Delete
andComment
buttons.Delete
button deletes the article. [Delete from the internal database]Comment
button lets a user add comment(s) to the article.Post Comment
form is available at the bottom of page.- On the comment page, any existing comments are listed.
- A new comment will be appended toward the bottom.
- Implement the app with MVC Restful API using
Node.js
,Express
, andHandlebars
- Utilize
MongoDB
andMongoose ODM
including multiple collections and embedded documents - Scrape an external web site for news articles using
cheerio
andaxios
- Heroku
- Live app - https://stormy-cove-58118.herokuapp.com/
- Scraping does not work on heroku. Please see Limitation and Potential issue
- Live app - https://stormy-cove-58118.herokuapp.com/
- GitHub
- Repository - https://github.com/mmakino/NewsHeadlines
- This full stack app can also be installed locally through the following steps
- Clone the git repository
git clone https://github.com/mmakino/NewsHeadlines.git
- Install necessary packages
npm install
- This app uses the following NPM packages:
"axios": "^0.18.0", "cheerio": "^1.0.0-rc.2", "express": "^4.16.4", "express-handlebars": "^3.0.1", "mongoose": "^5.4.15"
- MongoDB database
- MongoDB server
mongod
needs to be up and running with all CRUD privileges. models/index.js
includes the default setup- mongodb://localhost/newsHeadlines
models/ └── index.js
- Default setting
- HOST: localhost
- PORT: 3003
- MongoDB server
- Start the web server
npm run server
- It should display the following message when the server has started successfully
... ... App running on port 3003! Connected to MongoDB mongodb://localhost/newsHeadlines
- It should display the following message when the server has started successfully
- Open the web page in a browser by entering the following URL into the address bar.
http://localhost:3003/
- EE Times http://www.eetimes.com appears to be anti-scraping site
- Local server on your computer
- Most likely and hopefully it works
- As of 2019-12-09, the web site still seems to allow a few spoofed User-Agents.
- See implementation details in
routes/api/scrape.js
- See implementation details in
- The following article helped me get around from being blocked.
- Heroku deployment
- The User-Agent spoofing workaround does not seem to work unfortunately.
- The request gets code=H12 desc="Request timeout"
-
The main page displays
scraped articles
from EE Times -
Pressing the
Delete
button of each article will delete the article. -
The
Scrape
button on the navbar on top will fetch and (re-)populate articles. -
Pressing the
Comment
button of each article will take a user to an individual article page with comments (if already any). A user can post a comment using thePost Comment
form at the bottom of the page. -
A comment can be deleted by pressing the
Delete
button for each comment. -
The
Commented
button on the navbar will display only articles that have been commented.