Skip to content

Latest commit

 

History

History
207 lines (145 loc) · 6.32 KB

3-scrape-node.md

File metadata and controls

207 lines (145 loc) · 6.32 KB

🧹 Step 3 - Scrape the Products

How to manipulate and data with JavaScript from server side

Table of Contents

🎯 Objective

Scrape products with Node.js and use JavaScript as server-side scripting to manipulate and interact with array, objects, functions...

🏗 Prerequisites

  1. Be sure to have a clean working copy.

This means that you should not have any uncommitted local changes.

cd /path/to/workspace/clear-fashion
❯ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
  1. Pull the master branch to update your local with the new remote changes
❯ git remote add upstream git@github.com:92bondstreet/clear-fashion.git
## or ❯ git remote add upstream https://github.com/92bondstreet/clear-fashion
❯ git fetch upstream
❯ git pull upstream master
  1. Check the terminal output for the command node sandbox.js
cd /path/to/workspace/clear-fashion/server
## install dependencies
❯ yarn
## or ❯ npm install
❯ node sandbox.js

  1. If nothing happens or errors occur, check your node server installation (from Theme 2)

📱 How to scrape with Node.js? 1 example to do it

Let's try to scrape products from the e-shop brand Dedicated.

Step 1. No code, Investigation first

  1. Browse the website
  2. How the e-shop https://www.dedicatedbrand.com/en/ works?
  3. How can I access to the different products pages?
  4. What are the given properties for a Product: name, price, category, link...?
  5. Check how that you can get list of Products: web page itself, api etc.... (Inspect Network Activity - with Chrome DevTools for instance - on any browser)
  6. Define the JSON object representation for a Product
  7. ...
  8. ...

devtools

devtools

Step 2. Server-side with Node.js

Create a module called dedicatedbrand that returns the list of Products for a given url page of Dedicated.

Example of page to scrape: https://www.dedicatedbrand.com/en/men/news

const dedicatedbrand = require('dedicatedbrand');

const products = dedicatedbrand.scrape('https://www.dedicatedbrand.com/en/men/news');

products.forEach(product => {
  console.log(products.name);
})

📦 Suggested node modules

  • node-fetch - A light-weight module that brings Fetch API to Node.js.
  • cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
  • nodemon - Monitor for any changes in your node.js application and automatically restart the server - perfect for development

👕 A complete Scraping Example for dedicatedbrand.com

server/sources/dedicatedbrand.js contains a function to scrape a given Dedicated products page.

To start the example, call with node cli or use the Makefile target:

cd /path/to/workspace/clear-fashion/server
❯ node sandbox.js
❯ node sandbox.js "https://www.dedicatedbrand.com/en/men/t-shirts"## make sandbox## ./node_modules/.bin/nodemon sandbox.js
const fetch = require('node-fetch');
const cheerio = require('cheerio');

/**
 * Parse webpage e-shop
 * @param  {String} data - html response
 * @return {Array} products
 */
const parse = data => {
  const $ = cheerio.load(data);

  return $('.productList-container .productList')
    .map((i, element) => {
      const name = $(element)
        .find('.productList-title')
        .text()
        .trim()
        .replace(/\s/g, ' ');
      const price = parseInt(
        $(element)
          .find('.productList-price')
          .text()
      );

      return {name, price};
    })
    .get();
};

/**
 * Scrape all the products for a given url page
 * @param  {[type]}  url
 * @return {Array|null}
 */
module.exports.scrape = async url => {
  try {
    const response = await fetch(url);

    if (response.ok) {
      const body = await response.text();

      return parse(body);
    }

    console.error(response);

    return null;
  } catch (error) {
    console.error(error);
    return null;
  }
};

👩‍💻 Just tell me what to do

  1. Scrape Products for the 3 Brands defined by the json file ../server/brands.json

  2. Store the list into a JSON file

  3. Commit your modification

cd /path/to/workspace/clear-fashion
❯ git add -A && git commit -m "feat(shop): scrape new products"

(why following a commit message convention?)

  1. Commit early, commit often
  2. Don't forget to push before the end of the workshop
❯ git push origin master

Note: if you catch an error about authentication, add your ssh to your github profile.

  1. If you need some helps on git commands, read git - the simple guide

🛣️ Related Theme and courses