Skip to content

This repository shows how to use AWS step functions to train and deploy Autogluon tabular models on Amazon SageMaker

License

Notifications You must be signed in to change notification settings

aws-samples/aws-stepfunctions-automl-workflow

Manage AutoML workflows with AWS StepFunctions and AutoGluon on Amazon SageMaker

In this repository, we present a deployement-ready AWS stack which uses AWS Step Functions to orchestrate AutoML workflows using AutoGluon on Amazon SageMaker.

A complete description can be found in the corresponding blog post.

Main State Machine Training State Machine Deployment State Machine

Outline

Installation

Prerequisites

  • Node.js 16.13.1
  • Python 3.7.10

Step-by-step setup

  1. Clone this repository to your cloud environment of choice (Cloud9, EC2 instance, local aws environemnt, ...)

  2. Create IAM role needed to deploy the stack (skip to 3. if you already have a role with sufficient permissions and trust relationship).

  • Using AWS CLI

    1. Configure AWS CLI profile that you would like to use, if not configured yet with aws configure and follow the instructions
    2. Create a new IAM role which can be used by Cloud Formation with aws iam create-role --role-name {YOUR_ROLE_NAME} --assume-role-policy-document file://trust_policy.json
    3. Attach permissions policy to the new role aws iam put-role-policy --role-name {YOUR_ROLE_NAME} --policy-name {YOUR_POLICY_NAME} --policy-document file://permissions_policy.json
  • Alternatevily, you can create the role using AWS IAM Management Console. Once created, make sure to update Trust Relationship with trust_policy.json and attach a customer Permissions Policy based on permissions_policy.json

  1. Create a new python virtual environment python3 -m venv .venv

  2. Activate the environment source .venv/bin/activate

  3. Install AWS CDK npm install -g aws-cdk@2.8.0

  4. Install requirements pip install -r requirements.txt

  5. Bootstrap AWS CDK for your aws account cdk bootstrap aws://{AWS_ACCOUNT_ID}/{REGION}. If your account has been bootstrapped already with cdk@1.X, you may need to manually delete CDKToolkit stack from AWS CloudFormation console to avoid compatibility issues with cdk@2.X. Once de-bootstrapped, proceed by re-bootstrapping.

  6. Deploy the stack with cdk deploy -r {NEW_ROLE_ARN}

Notebook Walkthrough (SUGGESTED)

Once the stack is deployed, you can familiarize with the resources using the tutorial notebooks/AutoML Walkthrough.ipynb.

State Machines Input Documentation

Action flows defined using AWS Step Functions are called State Machine. Each machine has parameters that can be defined at runtime (i.e. execution-specific) which are specified through an input json object. Some exemples of input parameters are presented in notebooks/input/. Despite being meant to be used during the notebook tutorial, you can also copy/paste them directly into the AWS Console.

Request Syntax

{
    "Parameters": {
      "Flow": {
        "Train": true|false,
        "Evaluate": true|false,
        "Deploy": true|false
      },
      "PretrainedModel":{
          "Name": "string"
      },
      "Train": {
        "TrainDataPath": "string",
        "TestDataPath": "string",
        "TrainingOutput": "string",
        "InstanceCount": int,
        "InstanceType": "string",
        "FitArgs": "string"",
        "InitArgs": "string"
      },
      "Evaluation": {
        "Threshold": flaot,
        "Metric": "string"
      },
      "Deploy": {
        "InstanceCount": int,
        "InstanceType": "string",
        "Mode": "endpoint"|"batch",
        "BatchInputDataPath": "string",
        "BatchOutputDataPath": "string"
      }
    }
}

Parameters

  • Flow
    • Train (bool) - (REQUIRED) indicates if a new AutoGluon SageMaker Training Job is required. Set to false to deploy a pretrained model.
    • Evaluation (bool) - set to true if evaluation is required. If selected, a AWS Lambda will retreive model performances on test set and evaluate them agains user-defined threshold. If model performances are not satisfactory, deployment is skipped.
    • Deploy (bool) - (REQUIRED) indicates if model has to be deployed.
  • PretrainedModel
    • Name (string) - indicates which pre-trained model to be used for deployment. Models are referenced through their SageMaker Model Name. If Flow.Train = true this field is ignored, otherwise it's required.
  • Train (REQUIRED if Flow.Train = true)
    • TrainDataPath (string) - S3 URI where train csv is stored. Header and target variable are required. AutoGluon will perform holdout split for validation automatically.
    • TestDataPath (string) - S3 URI where test csv is stored. Header and target variable are required. Dataset is used to evaluate model performances on samples not seen during training.
    • TrainingOutput (string) - S3 URI where to store model artifacts at the end of training job.
    • InstanceCount (int) - Number of instances to be used for training.
    • InstanceType (string) - AWS instance type to be used for training (e.g. ml.m4.2xlarge). See full list here.
    • FitArgs (string) - double JSON-encoded dictionary containing parameters to be used during model .fit(). List of available parameters here. Dictionary needs to be encoded twice because it will be decoded both by State Machine and SageMaker Training Job.
    • InitArgs (string) - double JSON-encoded dictionary containing parameters to be used when model is initiated TabularPredictor(). List of available parameters here. Dictionary needs to be encoded twice because it will be decoded both by State Machine and SageMaker Training Job. Common parameters are label, problem_type and eval_metric.
  • Evaluation (REQUIRED if Flow.Evaluate = true)
    • Threshold (float) - Metric threshold to consider model performance satisfactory. All metrics are maximized (e.g. losses are repesented as negative losses).
    • Metric (string) - Metric name used for evaluation. Accepted metrics correspond to avaiable eval_metric from AutoGluon.
  • Deploy (REQUIRED if Flow.Deploy = true)
    • InstanceCount (int) - Number of instances to be used for training.
    • InstanceType (string) - AWS instance type to be used for training (e.g. ml.m4.2xlarge). See full list here.
    • Mode (string) - Model deployment mode. Supported modes are batch for SageMaker Batch Transform Job and endpoint for SageMaker Endpoint.
    • BatchInputDataPath (string) - (REQUIRED if mode=batch) S3 URI of dataset against which predictions are generated. Data must be store in csv format, without header and with same columns order of training dataset.
    • BatchOutputDataPath (string) - (REQUIRED if mode=batch) S3 URI to where to store batch predictions.

Repo structure

  • app.py entrypoint
  • stepfunctions_automl_workflow/lambdas/ AWS Lambda source scripts
  • stepfunctions_automl_workflow/utils/ utils functions used across for stack generation
  • stepfunctions_automl_workflow/stack.py CDK stack definition
  • notebooks/ Jupyter Notebooks to familiarise with the artifacts
  • notebooks/input/ Input examples to be fed in State Machines

Clean-up

WARNING: While you'll still be able to keep SageMaker artifacts, the AWS Step Functions State Machines will be deleted along with their execution history. Clean-up all resources with cdk destroy.

CDK cheatsheet

  • cdk ls list all stacks in the app
  • cdk synth emits the synthesized CloudFormation template
  • cdk deploy deploy this stack to your default AWS account/region
  • cdk diff compare deployed stack with current state
  • cdk docs open CDK documentation

Enjoy!

About

This repository shows how to use AWS step functions to train and deploy Autogluon tabular models on Amazon SageMaker

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published