This repo

🚧 ongoing work 🚧 I am constructing a knowledge graph of movies shown in Berlin cinemas.

Scraping cinema movies

We retrieve currently showing movies from Berlin.de using scrapy.

Start four containers:

a container where the scrapy job runs, and stops when finished.
a MongoDB database
the Nosqlclient (formerly mongoclient)
our Flask-RESTPlus backend

docker-compose build
docker-compose up

There are two alternatives for storing data: 1) write to MongoDB database, or 2) write to json file.

write to MongoDB database

Retrieve playing cinema movies (the specified pipeline will insert the data into MongoDB):

cd scrapy/kinoprogramm
scrapy crawl kinoprogramm

Open the mongo client on http://localhost:3300/ and connect to MongoDB by:

Click on "Connect" (up-right corner).
Click on "Edit" the default connection.
Clear connection url. Under the "Connection" tab, Database Name: kinoprogramm.
On tab "Authentication", Scram-Sha-1 as Authentication Type, Username: root, Password: 12345, Authentication DB: leave empty.
Click on "Save", and click on "Connect".

See stored data under "Collections" -> "kinos".

Go to "Tools" -> "Shell" to write mongodb queries such as:

db.kinos.distinct( "shows.title" )

write to json file

You need Python 3.6+ and requirements.txt.

You can start the spider by just:

cd scrapy/kinoprogramm
scrapy crawl kinoprogramm -o ../data/kinoprogramm.json

Data will be written to the file specified with the -o parameter. Data will also be written to the MongoDB database, unless the file pipelines.py is adapted.

Scrapy deployment

We present two alternatives: 1) deploy to the Scrapy Cloud, or 2) deploy to AWS

Scrapy Cloud

To deploy to the Scrapy Cloud:

Sign up to Scrapy Cloud. There is a free plan (but scraping jobs cannot be scheduled).
Create a new project
cd to movies-knowledgegraph/scrapy
Deploy by pip install shub, shub login, shub deploy <PROJECT_ID>

Link to Scrapinghub Support Center.

Link to Scrapinghub API Reference.

Once deployed, the spider can run by:

Retrieve the API key
Run spider by:

curl -u <API_KEY>: https://app.scrapinghub.com/api/run.json -d project=<PROJECT_ID> -d spider=kinoprogramm

Scraped data can be retrieved by:

curl -u <API_KEY>: https://storage.scrapinghub.com/items/:<PROJECT_ID>[/<SPIDER_ID>][/<JOB_ID>][/<ITEM_NUMBER>][/<FIELD_NAME>]

Example retrieving contact from first cinema (item 0) of spider 1 job 6 and project id 417389:

curl -u <API_KEY>: https://storage.scrapinghub.com/items/417389/1/6/0/contact

AWS

We push our scrapy Docker image to AWS ECR and start (manually, or event-based) the scraping task with AWS Fargate, which writes resulting jsons to a bucket in AWS S3.

See deployment.

Backend

You can access the Swagger UI of Flask-RESTPlus backend under http://localhost:8001/.

Here, you can use the different endpoints to retrieve data from the MongoDB database.

Tests

After installing requirements_tests.txt, tests for scrapy can be run by:

cd scrapy/kinoprogramm
python -m pytest tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This repo

Scraping cinema movies

write to MongoDB database

write to json file

Scrapy deployment

Scrapy Cloud

AWS

Backend

Tests

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
backend		backend
deployment		deployment
mongo		mongo
scrapy		scrapy
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

laufergall/movies-knowledgegraph

Folders and files

Latest commit

History

Repository files navigation

This repo

Scraping cinema movies

write to MongoDB database

write to json file

Scrapy deployment

Scrapy Cloud

AWS

Backend

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages