article-rec-training-job

Job that runs every two hours to create a new batch of article recommendations, using the latest Snowplow data available.

Directory Layout

.
├── cdk        # infrastructure as code for this service
├── db         # object-relational mappings to interact with the database
├── job        # key steps of the job pipeline (data fetch, preprocessing, training, etc...)
├── lib        # helpers to interact with lnl's aws resources
├── sites      # site-specific logic for each newsroom partner
└── tests      # unit tests

Environment

Environment parameters are defined in env.json.

You can add a new secret parameter using AWS SSM.

Development Tools

We use Poetry to manage dependencies. It also helps with pinning dependency and python versions. We also use pre-commit with hooks for isort, black, and flake8 for consistent code style and readability. Note that this means code that doesn't meet the rules will fail to commit until it is fixed.

We also use mypy for static type checking. This can be run manually, and the CI runs it on PRs.

Setup

Install Poetry.
Run poetry install --no-root
Make sure the virtual environment is active, then
Run pre-commit install

You're all set up! Your local environment should include all dependencies, including dev dependencies like black. This is done with Poetry via the poetry.lock file. As for the containerized code, that still pulls dependencies from requirements.txt. Any containerized dependency requirements need to be updated in pyproject.toml then exported to requirements.txt.

Run Code Format and Linting

To manually run isort, black, and flake8 all in one go, simply run pre-commit run --all-files.

Run Static Type Checking

To manually run mypy, simply run mypy from the root directory of the project. It will use the default configuration specified in the mypy.ini file.

Update Dependencies

To update dependencies in your local environment, make changes to the pyproject.toml file then run poetry update. To update requirements.txt for the container, run poetry export -o requirements.txt --without-hashes.

Local Usage

Build container from the Dockerfile

kar build

Run the job

kar run

Or, run bash in the container

kar run bash

Running Tests

Build container from the Dockerfile

kar build

Run unit tests

kar test

Running Backfills

Build container from the Dockerfile

kar build

Run the backfill task for the data warehouse

kar backfill --start-date 2021-12-01 --days 10

Or, run the backfill task for the article table

kar article-backfill --site texas-tribune --start-date 2021-12-01 --days 10

Deploying

For dev deployment, run:

kar deploy

Each pull request to main will trigger a new prod deployment when merged.

Monitoring

Logs

Each log group contains separate log streams for each client

Dev
Prod

System Dashboards

Hyperparamter tuning

Hyperparamter tuning is supported to find the model parameters that optimize the mean reciprocal rank of the model over a holdout test set.

To run a tuning job, modify the PARAMS object in the site you plan to run a job on. An exmaple is shown below:

PARAMS = {
    "hl": 15.0,
    "embedding_dim": 500,
    "epochs": 2,
     "tune": True,
     "tune_params": ["embedding_dim"],
     "tune_range": [[100,600,100]]
}

The tuner will grid search over all hyperparameter added to the tune_params key-value store. It will then search over the range in the corresponding index in the tune_range key-value store. The last value of the range is used as a step.

The tuner will output the best parameters to the logs and finally train the model on the best parameters it has found.

Other Resources

Misc Documentation

Related Repositories

infrastructure: The database and ECS clusters are created here.
article-rec-db: The relevant database migrations are defined and applied here.
article-rec-api: Calls to the API created by this repository return article recommendations and model versions saved by the training pipeline. The API is used to surface recommendations on the front-end.
snowplow-analytics: The analytics pipeline used to collect user clickstream data into s3 is defined in this repository.
article-recommendations: The recommendations are displayed on WordPress NewsPack sites using the PHP widget defined in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
.dvc		.dvc
.github/workflows		.github/workflows
cdk		cdk
db		db
docs/images		docs/images
job		job
lib		lib
scripts		scripts
sites		sites
tests		tests
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Brewfile		Brewfile
Dockerfile		Dockerfile
Karfile		Karfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
env.json		env.json
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

article-rec-training-job

Directory Layout

Environment

Development Tools

Setup

Run Code Format and Linting

Run Static Type Checking

Update Dependencies

Local Usage

Running Tests

Running Backfills

Deploying

Monitoring

Logs

System Dashboards

Philadelphia Inquirer

Texas Tribune

Washington City Paper

Hyperparamter tuning

Other Resources

Misc Documentation

Related Repositories

Architecture Diagram

About

Releases

Packages

Contributors 9

Languages

License

LocalAtBrown/article-rec-training-job

Folders and files

Latest commit

History

Repository files navigation

article-rec-training-job

Directory Layout

Environment

Development Tools

Setup

Run Code Format and Linting

Run Static Type Checking

Update Dependencies

Local Usage

Running Tests

Running Backfills

Deploying

Monitoring

Logs

System Dashboards

Philadelphia Inquirer

Texas Tribune

Washington City Paper

Hyperparamter tuning

Other Resources

Misc Documentation

Related Repositories

Architecture Diagram

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages