🚧 ongoing work 🚧 I am constructing a knowledge graph of movies shown in Berlin cinemas.
We retrieve currently showing movies from Berlin.de using scrapy.
Start four containers:
- a container where the scrapy job runs, and stops when finished.
- a MongoDB database
- the Nosqlclient (formerly mongoclient)
- our Flask-RESTPlus backend
docker-compose build
docker-compose up
There are two alternatives for storing data: 1) write to MongoDB database, or 2) write to json file.
Retrieve playing cinema movies (the specified pipeline will insert the data into MongoDB):
cd scrapy/kinoprogramm
scrapy crawl kinoprogramm
Open the mongo client on http://localhost:3300/
and connect to MongoDB by:
- Click on "Connect" (up-right corner).
- Click on "Edit" the default connection.
- Clear connection url. Under the "Connection" tab, Database Name:
kinoprogramm
. - On tab "Authentication",
Scram-Sha-1
as Authentication Type, Username:root
, Password:12345
, Authentication DB: leave empty. - Click on "Save", and click on "Connect".
See stored data under "Collections" -> "kinos".
Go to "Tools" -> "Shell" to write mongodb queries such as:
db.kinos.distinct( "shows.title" )
You need Python 3.6+ and requirements.txt.
You can start the spider by just:
cd scrapy/kinoprogramm
scrapy crawl kinoprogramm -o ../data/kinoprogramm.json
Data will be written to the file specified with the -o
parameter. Data will also be written to the MongoDB database, unless the file pipelines.py
is adapted.
We present two alternatives: 1) deploy to the Scrapy Cloud, or 2) deploy to AWS
To deploy to the Scrapy Cloud:
- Sign up to Scrapy Cloud. There is a free plan (but scraping jobs cannot be scheduled).
- Create a new project
- cd to
movies-knowledgegraph/scrapy
- Deploy by
pip install shub
,shub login
,shub deploy <PROJECT_ID>
Link to Scrapinghub Support Center.
Link to Scrapinghub API Reference.
Once deployed, the spider can run by:
-
Retrieve the API key
-
Run spider by:
curl -u <API_KEY>: https://app.scrapinghub.com/api/run.json -d project=<PROJECT_ID> -d spider=kinoprogramm
- Scraped data can be retrieved by:
curl -u <API_KEY>: https://storage.scrapinghub.com/items/:<PROJECT_ID>[/<SPIDER_ID>][/<JOB_ID>][/<ITEM_NUMBER>][/<FIELD_NAME>]
Example retrieving contact from first cinema (item 0) of spider 1 job 6 and project id 417389:
curl -u <API_KEY>: https://storage.scrapinghub.com/items/417389/1/6/0/contact
We push our scrapy Docker image to AWS ECR and start (manually, or event-based) the scraping task with AWS Fargate, which writes resulting jsons to a bucket in AWS S3.
See deployment.
You can access the Swagger UI of Flask-RESTPlus backend under http://localhost:8001/
.
Here, you can use the different endpoints to retrieve data from the MongoDB database.
After installing requirements_tests.txt
, tests for scrapy can be run by:
cd scrapy/kinoprogramm
python -m pytest tests/