NOPAS is a template for a Go service that scales nomad tasks
- Parametric: policy is parameter-dependent -> can be changed dynamically via HTTP calls
- Auto: given a policy, its self-correcting
- Scaler: scales nomad tasks + EC2 instances
Purpose Existing nomad metrics based autoscalers use CPU and Memory which is not sufficient for all use cases. At GovTech, our data scientists use Spark to crunch data on a daily basis. On one hand, it is costly to keep a large amount of compute resources ready at all times while on the other hand, using off-the-shelf cpu/memory-based autoscaling services may be too unresponsive.
NOPAS was built to enable users to easily add subpolicies based on more business-related needs such as pre-emptively scaling up resources in anticipation of user needs and scaling down outside of specific time periods to save cost. It comes with a simple UI for non-technical users.
Declare other env vars such as ASG_ID
(AWS_ACCESS_KEY_ID), ASG_SECRET
(AWS_SECRET_ACCESS_KEY),
VAULT_ADDR
and VAULT_TOKEN
in an .env
file.
export VAULT_TOKEN=$(cat ~/.vault-token)
docker-compose up
Policy
GET /policy
POST /policy
POST
would require a policy in the form of an application/json
as described below.
State
GET /state
PUT /state/pause
PUT /state/resume
GET /state
expects a 200
and a boolean
on whether NOPAS is running.
Resouce Count
GET /resource
Returns an object with keys representing resource and value representing current count of resource.
{
"Resource1":3,
"Resource2":50
}
Predefined
GET /predefined
Returns an object with 2 fields each providing a list of predefined subpolicy and ensembler names.
{
"subpolicies": ["policy1", "policy2"],
"ensemblers": ["conservative", "average"],
}
Health
GET /ping
expects a status code of 200
and a pong
A policy will govern how the scaling service manage the resources assigned to it. Each checking-scaliing cycle is performed by the policy in the following manner:
- Each subpolicy will produce a map of resource to recommended count via various logic
- Collate a list of recommended counts for each resource and resolve it via an ensembling method.
- Each resource will check if the recommendation is within their allowable limits (
MaxCount
andMinCount
) and if the previous scaling operation is within the cooldown period (Cooldown
). - If conditions are met, scaling is performed.
Policy Structure
- Checking frequency
- Resources
- Subpolicies
- Ensembling method
The autoscaling service will regularly initiate a checking-scaling cycle based on a user-defined time interval. The default checking frequency is 10s.
A resource refers to both the compute resource (e.g. EC2) and the nomad client. Resources are independent of other resources in the eyes of the policy when scaling is performed.
Resource Definition
- EC2 (see below)
- Nomad (see below)
- Cooldown - minimum duration between scaling. A required string indicating duration, e.g. "1m40s".
- N2CRatio - ratio of nomad to compute resources. A required number.
EC2 Definition
Name | Description | Type | Required |
---|---|---|---|
ScalingGroupName | AWS EC2 auto-scaling group name | string | Yes |
Region | AWS service region, e.g. ap-southeast-1 |
string | Yes |
MaxCount | Maximum allowable desired count | number | Yes |
MinCount | Minimum allowable desired count | number | Yes |
Nomad Definition
Name | Description | Type | Required |
---|---|---|---|
Address | Address of nomad service | string | Yes |
JobName | Name of nomad job to be tracking and updating | string | Yes |
NomadPath | Vault path for Nomad ACL token | string | Yes |
MaxCount | Maximum allowable desired count | number | Yes |
MinCount | Minimum allowable desired count | number | Yes |
"Example": {
"EC2": {
"ScalingGroupName": "<<auto scaling group name>>",
"Region": "ap-southeast-1",
"MaxCount": 25,
"MinCount": 1
},
"Nomad": {
"Address": "<nomad address>",
"JobName": "<nomad job name>",
"NomadPath": "<secret's path>",
"MaxCount": 25,
"MinCount": 1
},
"Cooldown": "1m0s",
"N2CRatio": 1
}
Subpolicies outline the logic behind deriving a recommended nomad task-group count.
Each sub-policy will
- track a metric
- recommend counts for resources under its management
Implementing a new subpolicy
Users can implement their own custom subpolicy by implementing the Subpolicy
interface and follow the GenericSubPolicy
structure. Two examples, core_ratio_subpolicy
and office_hour_subpolicy
have been implemented.
Core ratio subpolicy tracks a Spark master endpoint to find out the core usage and scales accordingly while the office hour subpolicy keeps a minimum count of resources between predefined hours.
Subpolicy API
Name | Description | Type | Required |
---|---|---|---|
Name | Name of subpolicy. Important This name needs to match the string in the CreateSpecificSubpolicy function. |
string | Yes |
ManagedResources | List of resource to be managed by subpolicy. Resource name needs to match corresponding resource key in Resources part of the policy definition |
array[string] | Yes |
Metadata | Metadata specific to sub-policy. | Object | Yes |
For example: |
{
"Name": "CoreRatio",
"ManagedResources": [
"SparkWorker"
],
"Metadata": {
"MetricSource": "https://some-endpoint",
"UpThreshold": 0.5,
"DownThreshold": 0.25,
"ScaleUp": {
"Changetype": "multiply",
"ChangeValue": 2
},
"ScaleDown": {
"Changetype": "multiply",
"ChangeValue": 0.5
}
}
}
Given that each subpolicy will recommend a count, Ensemble.go
provides ensembling methods to resolve multiple recommendations.
Users can implement their own methods by implementing the Ensembler
interface.
Various ensembling methods can be considered for each resource
- Conservative(takes the maximum to be safe)
- Averaging
- Cost-saving(takes the minimum to save cost)
{
"CheckingFreq": "10s",
"Resources": {
"ImportantJob": {
"EC2": {
"ScalingGroupName": "group_name",
"Region": "ap-southeast-1",
"MaxCount": 25,
"MinCount": 1
},
"Nomad": {
"Address": "https://example.nomad.address",
"JobName": "important_job",
"NomadPath": "",
"MaxCount": 25,
"MinCount": 1
},
"Cooldown": "1m0s",
"N2CRatio": 1
}
},
"Subpolicies": [
{
"Name": "CoreRatio",
"ManagedResources": [
"SparkWorker"
],
"Metadata": {
"MetricSource": "https://some-endpoint",
"UpThreshold": 0.5,
"DownThreshold": 0.25,
"ScaleOut": {
"Changetype": "multiply",
"ChangeValue": 2
},
"ScaleIn": {
"Changetype": "multiply",
"ChangeValue": 0.5
}
}
}
],
"Ensembler": "Conservative"
}