Skip to content

LocalAtBrown/article-rec-training-job

Repository files navigation

article-rec-training-job

Job that runs every two hours to create a new batch of article recommendations, using the latest Snowplow data available.

Directory Layout

.
├── cdk        # infrastructure as code for this service
├── db         # object-relational mappings to interact with the database
├── job        # key steps of the job pipeline (data fetch, preprocessing, training, etc...)
├── lib        # helpers to interact with lnl's aws resources
├── sites      # site-specific logic for each newsroom partner
└── tests      # unit tests

Environment

Environment parameters are defined in env.json.

You can add a new secret parameter using AWS SSM.

Development Tools

We use Poetry to manage dependencies. It also helps with pinning dependency and python versions. We also use pre-commit with hooks for isort, black, and flake8 for consistent code style and readability. Note that this means code that doesn't meet the rules will fail to commit until it is fixed.

We also use mypy for static type checking. This can be run manually, and the CI runs it on PRs.

Setup

  1. Install Poetry.
  2. Run poetry install --no-root
  3. Make sure the virtual environment is active, then
  4. Run pre-commit install

You're all set up! Your local environment should include all dependencies, including dev dependencies like black. This is done with Poetry via the poetry.lock file. As for the containerized code, that still pulls dependencies from requirements.txt. Any containerized dependency requirements need to be updated in pyproject.toml then exported to requirements.txt.

Run Code Format and Linting

To manually run isort, black, and flake8 all in one go, simply run pre-commit run --all-files.

Run Static Type Checking

To manually run mypy, simply run mypy from the root directory of the project. It will use the default configuration specified in the mypy.ini file.

Update Dependencies

To update dependencies in your local environment, make changes to the pyproject.toml file then run poetry update. To update requirements.txt for the container, run poetry export -o requirements.txt --without-hashes.

Local Usage

  1. Build container from the Dockerfile
kar build
  1. Run the job
kar run
  1. Or, run bash in the container
kar run bash

Running Tests

  1. Build container from the Dockerfile
kar build
  1. Run unit tests
kar test

Running Backfills

  1. Build container from the Dockerfile
kar build
  1. Run the backfill task for the data warehouse
kar backfill --start-date 2021-12-01 --days 10
  1. Or, run the backfill task for the article table
kar article-backfill --site texas-tribune --start-date 2021-12-01 --days 10

Deploying

For dev deployment, run:

kar deploy

Each pull request to main will trigger a new prod deployment when merged.

Monitoring

Logs

Each log group contains separate log streams for each client

System Dashboards

Philadelphia Inquirer

Texas Tribune

Washington City Paper

Hyperparamter tuning

Hyperparamter tuning is supported to find the model parameters that optimize the mean reciprocal rank of the model over a holdout test set.

To run a tuning job, modify the PARAMS object in the site you plan to run a job on. An exmaple is shown below:

PARAMS = {
    "hl": 15.0,
    "embedding_dim": 500,
    "epochs": 2,
     "tune": True,
     "tune_params": ["embedding_dim"],
     "tune_range": [[100,600,100]]
}

The tuner will grid search over all hyperparameter added to the tune_params key-value store. It will then search over the range in the corresponding index in the tune_range key-value store. The last value of the range is used as a step.

The tuner will output the best parameters to the logs and finally train the model on the best parameters it has found.

Other Resources

Misc Documentation

Related Repositories

  • infrastructure: The database and ECS clusters are created here.
  • article-rec-db: The relevant database migrations are defined and applied here.
  • article-rec-api: Calls to the API created by this repository return article recommendations and model versions saved by the training pipeline. The API is used to surface recommendations on the front-end.
  • snowplow-analytics: The analytics pipeline used to collect user clickstream data into s3 is defined in this repository.
  • article-recommendations: The recommendations are displayed on WordPress NewsPack sites using the PHP widget defined in this repository.

Architecture Diagram

architecture diagram

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published