Assuming we have gone through all steps of the AWS Deployment Walkthrough (see below). If we want to start the scraping job ourselves, instead of through the scheduled event, we just:
aws ecs run-task --cli-input-json file://deployment/ecs/run-task/kinoprogramm-scraper.json
When the task is completed, json files should have been written to S3. See athena queries for accessing the scraped data.
Alternatively, for debugging purposes, the scraping job can be triggered to be run locally, and, by providing the AWS credentials, the scraped jsons will be written to S3.
Assuming you have built the kinoprogramm-scraper:latest
image:
docker run -e AWS_ACCESS_KEY_ID=<your AWS access key ID> -e AWS_SECRET_ACCESS_KEY=<your AWS secret access key> kinoprogramm-scraper:latest
To start over with the complete deployment in AWS, see next section (AWS Deployment Walkthrough).
If we have only introduced a modification in the code (e.g. needed to adapt the scraping to changes in the website), then we just need to push our new docker image:
→ Retrieve the login command to use to authenticate your Docker client to your registry:
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 287094319766.dkr.ecr.eu-central-1.amazonaws.com
→ After successful authentication, build, tag, and push the docker image:
docker build -t kinoprogramm-scraper scrapy
docker tag kinoprogramm-scraper:latest 287094319766.dkr.ecr.eu-central-1.amazonaws.com/kinoprogramm-scraper:dev
docker push 287094319766.dkr.ecr.eu-central-1.amazonaws.com/kinoprogramm-scraper:dev
No need to update the event rule or event target. The next event will trigger the job in the newly pushed image.
→ Verify everything runs as expected by starting the task manually:
aws ecs run-task --cli-input-json file://deployment/ecs/run-task/kinoprogramm-scraper.json
When the task is completed, json files should have been written to S3. See athena queries for accessing the scraped data.
These are all steps for the deployment in my personal AWS account, with ID 287094319766
, where my username is Laura
.
Basically, we deploy a Docker container with our scraper to Amazon Elastic Container Service (ECS) and create a task to run the scrapping job. We can start this task anytime from the command line, or establish a rule to trigger the task.
Assuming that:
- The AWS CLI version 2 has already been installed and configured with the corresponding
Access key ID
,Secret access key
, andregion
for the AWS account to deploy to. - User has at least the following policies:
AmazonS3FullAccess
,AmazonECS_FullAccess
,CloudWatchEventsFullAccess
→ Create bucket kinoprogramm-scraper
and path berlin-de
where we will store the scraped data
aws s3 mb s3://kinoprogramm-scraper --region eu-central-1 --endpoint-url https://s3.eu-central-1.amazonaws.com
aws s3api put-object --bucket kinoprogramm-scraper --key berlin-de
→ Create policy ecrDeveloper
to allow users access to ecr:
aws iam create-policy --policy-name ecrDeveloper --policy-document file://deployment/policies/ecr_developer.json
"Arn": "arn:aws:iam::287094319766:policy/ecrDeveloper"
→ Attach created policy to user (in this case, my username is Laura
):
aws iam attach-user-policy --user-name Laura --policy-arn arn:aws:iam::287094319766:policy/ecrDeveloper
→ Create docker registry repository:
aws ecr create-repository --repository-name kinoprogramm-scraper
"repositoryArn": "arn:aws:ecr:eu-central-1:287094319766:repository/kinoprogramm-scraper"
"repositoryUri": "287094319766.dkr.ecr.eu-central-1.amazonaws.com/kinoprogramm-scraper"
To push our scraper image to this repository:
→ Retrieve the login command to use to authenticate your Docker client to your registry:
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 287094319766.dkr.ecr.eu-central-1.amazonaws.com
→ After successful authentication, build, tag, and push the docker image:
docker build -t kinoprogramm-scraper scrapy
docker tag kinoprogramm-scraper:latest 287094319766.dkr.ecr.eu-central-1.amazonaws.com/kinoprogramm-scraper:dev
docker push 287094319766.dkr.ecr.eu-central-1.amazonaws.com/kinoprogramm-scraper:dev
→ Create a cluster to run scraper task:
aws ecs create-cluster --cluster-name scraperCluster
"clusterArn": "arn:aws:ecs:eu-central-1:287094319766:cluster/scraperCluster"
→ Create policy writeScraped
for write access to our bucket in S3:
aws iam create-policy --policy-name writeScraped --policy-document file://deployment/policies/write_scraped.json
"Arn": "arn:aws:iam::287094319766:policy/writeScraped"
→ Create role KinoprogrammScraperRole
to execute scraper and attach policies writeScraped and AmazonECSTaskExecutionRolePolicy to this role:
aws iam create-role --role-name KinoprogrammScraperRole --assume-role-policy-document file://deployment/policies/kinoprogramm_scraper_role.json
"Arn": "arn:aws:iam::287094319766:role/KinoprogrammScraperRole"
aws iam attach-role-policy --role-name KinoprogrammScraperRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
aws iam attach-role-policy --role-name KinoprogrammScraperRole --policy-arn arn:aws:iam::287094319766:policy/writeScraped
→ Create generic role ecsTaskExecutionRole
for executing ECS tasks and attach policy AmazonECSTaskExecutionRolePolicy to this role:
aws iam create-role --role-name ecsTaskExecutionRole --assume-role-policy-document file://deployment/policies/ecs_task_execution_role.json
"Arn": "arn:aws:iam::287094319766:role/ecsTaskExecutionRole"
aws iam attach-role-policy --role-name ecsTaskExecutionRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
→ Create generic role ecsEventsRole
for starting tasks from events, and attach policy AmazonEC2ContainerServiceEventsRole to this role:
aws iam create-role --role-name ecsEventsRole --assume-role-policy-document file://deployment/policies/ecs_events_role.json
"Arn": "arn:aws:iam::287094319766:role/ecsEventsRole"
aws iam attach-role-policy --role-name ecsEventsRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceEventsRole
→ Create the log group for the scraper:
aws logs create-log-group --log-group-name scraperLogs
→ Register task definition for scraper:
aws ecs register-task-definition --cli-input-json file://deployment/ecs/register-task-definition/kinoprogramm-scraper.json
"taskDefinitionArn": "arn:aws:ecs:eu-central-1:287094319766:task-definition/KinoprogrammScraper:1"
→ Put rule for scraper (e.g. scraping once a day):
aws events put-rule --cli-input-json file://deployment/events/put-rule/kinoprogramm-scraper.json
"RuleArn": "arn:aws:events:eu-central-1:287094319766:rule/kinoprogrammScraperRule"
→ Assign target to be triggered by the event. Using the default Security Group and the default Subnet corresponding to Availability Zone eu-central-1a.
aws events put-targets --cli-input-json file://deployment/events/put-targets/kinoprogramm-scraper.json
→ Activate rule:
aws events enable-rule --name kinoprogrammScraperRule
We do not need to wait for the task trigger, but we can also initiate it ourselves. Assuming all steps above have been completed.
→ Run task. Using the default Security Group and the default Subnet corresponding to Availability Zone eu-central-1a.
aws ecs run-task --cli-input-json file://deployment/ecs/run-task/kinoprogramm-scraper.json
→ See task ARN:
aws ecs list-tasks --cluster scraperCluster
When the task is completed, json files should have been written to S3. See athena queries for accessing the scraped data.
→ Install npm.
→ Install the serverless package:
npm install -g serverless
→ Create a serverless project:
cd deployment
serverless
AWS Python
- name:
jsonEmail
The serverless.yml
, handler.py
, and .gitignore
files are created under the new folders (one for each of the created projects).
→ Edit the files: /jsonEmail/serverless.yaml
, /jsonEmail/handler.py
.
Documentation: send email with boto3, add attachment
→ Create policy that allows to work with SES, e.g. verify email addresses:
aws iam create-policy --policy-name emailSender --policy-document file://deployment/policies/email_sender.json
"Arn": "arn:aws:iam::287094319766:policy/emailSender"
→ Attach created policy to user:
aws iam attach-user-policy --user-name Laura --policy-arn arn:aws:iam::287094319766:policy/emailSender
→ Verify the sender and recipient(s) email address:
aws ses verify-email-identity --email-address email_1@address.com
aws ses verify-email-identity --email-address email_2@address.com
A verification message is sent to the inbox of the email address to be verified. Need to click on the provided link to proceed. Check that email address has been verified:
aws ses list-identities
aws ses get-identity-verification-attributes --identities "email_1@address.com"
→ Deploy. Will bundle up and deploy the Lambda function.
cd jsonEmail
sls deploy
To remove deployment:
sls remove