Job that runs every two hours to create a new batch of article recommendations, using the latest Snowplow data available.
.
├── cdk # infrastructure as code for this service
├── db # object-relational mappings to interact with the database
├── job # key steps of the job pipeline (data fetch, preprocessing, training, etc...)
├── lib # helpers to interact with lnl's aws resources
├── sites # site-specific logic for each newsroom partner
└── tests # unit tests
Environment parameters are defined in env.json
.
You can add a new secret parameter using AWS SSM.
We use Poetry to manage dependencies. It also helps with pinning dependency and python versions. We also use pre-commit with hooks for isort, black, and flake8 for consistent code style and readability. Note that this means code that doesn't meet the rules will fail to commit until it is fixed.
We also use mypy for static type checking. This can be run manually, and the CI runs it on PRs.
- Install Poetry.
- Run
poetry install --no-root
- Make sure the virtual environment is active, then
- Run
pre-commit install
You're all set up! Your local environment should include all dependencies, including dev dependencies like black
.
This is done with Poetry via the poetry.lock
file. As for the containerized code, that still pulls dependencies from
requirements.txt
. Any containerized dependency requirements need to be updated in pyproject.toml
then exported to
requirements.txt
.
To manually run isort, black, and flake8 all in one go, simply run pre-commit run --all-files
.
To manually run mypy, simply run mypy
from the root directory of the project. It will use the default configuration
specified in the mypy.ini file.
To update dependencies in your local environment, make changes to the pyproject.toml
file then run poetry update
.
To update requirements.txt
for the container, run poetry export -o requirements.txt --without-hashes
.
- Build container from the Dockerfile
kar build
- Run the job
kar run
- Or, run bash in the container
kar run bash
- Build container from the Dockerfile
kar build
- Run unit tests
kar test
- Build container from the Dockerfile
kar build
- Run the backfill task for the data warehouse
kar backfill --start-date 2021-12-01 --days 10
- Or, run the backfill task for the article table
kar article-backfill --site texas-tribune --start-date 2021-12-01 --days 10
For dev deployment, run:
kar deploy
Each pull request to main will trigger a new prod deployment when merged.
Each log group contains separate log streams for each client
Hyperparamter tuning is supported to find the model parameters that optimize the mean reciprocal rank of the model over a holdout test set.
To run a tuning job, modify the PARAMS object in the site you plan to run a job on. An exmaple is shown below:
PARAMS = {
"hl": 15.0,
"embedding_dim": 500,
"epochs": 2,
"tune": True,
"tune_params": ["embedding_dim"],
"tune_range": [[100,600,100]]
}
The tuner will grid search over all hyperparameter added to the tune_params
key-value store. It will then search over the range in the corresponding index in the tune_range
key-value store. The last value of the range is used as a step.
The tuner will output the best parameters to the logs and finally train the model on the best parameters it has found.
infrastructure
: The database and ECS clusters are created here.article-rec-db
: The relevant database migrations are defined and applied here.article-rec-api
: Calls to the API created by this repository return article recommendations and model versions saved by the training pipeline. The API is used to surface recommendations on the front-end.snowplow-analytics
: The analytics pipeline used to collect user clickstream data into s3 is defined in this repository.article-recommendations
: The recommendations are displayed on WordPress NewsPack sites using the PHP widget defined in this repository.