This is a small project to aggregate news feeds (via Google Alerts) for my personal use at neal.news.
neal.news is hosted on AWS. A rough outline is:
- Email recieved by SES, stored to S3
- Lambda function fetches email, extracts content, applys template, pushes HTML back to S3, and queues a scoring job.
- S3 serves static site.
- Go to create test event
- Fill in the
messageId
below - it is the only field actually used, the other mail headers will be pulled from the file with that ID on S3
{
"Records": [
{
"eventSource": "aws:ses",
"eventVersion": "1.0",
"ses": {
"mail": {
"timestamp": "2020-07-30T14:20:54.877Z",
"source": "3RdciXxQKAH4iqqingcngtvu-pqtgrn0iqqing.eqo@alerts.bounces.google.com",
"messageId": "qlkjb610bfq4c4nlm99u2plkh9li82uo4iu2s401",
"destination": [
"foo@neal.news"
]
}
}
}
]
}
-
Clicked items are logged back to CloudWatch events (via an API Gateway).
-
Those historical clicks are used to train a classifier
- Currently using BERT + xgboost
- Retrained on a SageMaker GPU spot instance every Monday.
-
That model is used to re-rank incoming news.
- Read/write index.html from S3.
- 10% of items are scored randomly to help mitigate overfitting.