Skip to content

Latest commit

 

History

History
42 lines (26 loc) · 2.45 KB

README.md

File metadata and controls

42 lines (26 loc) · 2.45 KB

Web Scraping with Scrapy and MongoDB running on Docker

Have you ever wanted to know when a movie is shown in the cinema, but then been bothered by having to search the internet? In this workshop, we will scrape to collect our own cinema program using scrapy. Data will be stored in a MongoDB database and different queries will allow us to retrieve any information we need. We will dockerize this application so that we are able to run it in any platform.

Requirements

Technical:

  • Python 3.6+
  • An IDE, e.g. Spyder or PyCharm, or a decent code editor like Atom or Visual Studio Code. You can follow this guide to get python, anaconda environments, and an IDE set up.
  • Docker Desktop or Docker Toolbox, whichever works in your system, this post can help you decide. Follow the docker documentation to install Docker and verify your installation.

Background knowledge:

  • Basic Python
  • Ideally but not required: basic SQL knowledge and some familiarity with Docker

Workshop

Introduction

This workshop consists on looking at different components of my application for scraping the Berlin cinema program, writing the data to a NoSQL database, and backend api enpoints for querying the data.

This application still needs lots of improvement, but it serves to illustrate the different aspects.

Based on this code, we will go through a series of hands-on tasks to understand:

  • ✓ In General, how such an application can be architectured
  • ✓ Employing scrapy to retrieve data from the internet
  • ✓ Establishing the connection from scrapy to a database
  • ✓ Using a database client to check out the stored data
  • ✓ Implementing backend REST API endpoints to query the stored data in the form users need it
  • ✓ Dockerizing all components: scrapy, database, client, and backend

Hands-on

  1. Scrape the cinema program
  2. Write scraped data to database and connect to it with a db client
  3. Call REST API endpoints to retrieve information

Concluding remarks.