Skip to content
This repository has been archived by the owner on Jun 28, 2023. It is now read-only.

Commit

Permalink
v1.0-prerelease
Browse files Browse the repository at this point in the history
  • Loading branch information
ArseniiPetrovich committed May 28, 2020
1 parent d836e05 commit fc590e4
Show file tree
Hide file tree
Showing 55 changed files with 2,419 additions and 185 deletions.
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[submodule "azure/modules/load_balancer"]
path = azure/modules/load_balancer
url = https://github.com/ArseniiPetrovich/terraform-azurerm-loadbalancer.git
branch = patch-2
25 changes: 21 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@ The Polkadot node and the Consul server are both installed during instance start
## Leader election mechanism overview

As for the Leader election mechanism the script reuses the existing Leader election solution implemented by [Hashicorp](https://www.hashicorp.com/) team in their [Consul](https://www.consul.io/) solution. The very minimum of 3 nodes is required to start the failover scripts. This requirement comes from the [Raft algorithm](https://www.consul.io/docs/internals/consensus.html) that is used to reach consensus on the current validator.

The algorithm works by electing one leader accross all nodes joined to the cluster. When one of the instances goes down the rest can still reach consensus via a majority quorum. If the majority quorum is reached (2 out of 3 nodes, 3 out of 5 nodes, etc.), the new validator gets elected. Non-selected nodes continue to operate in non-validator mode. An even number of instances could cause the so-called "split-brain" to occur in case exactly half of the nodes go offline. No leader will be elected at all because no majority quorum can be reached with 2 out of 4 (3 ot out of 6, etc.) instances (51% of votes not reached).

Even if the entire region of a cloud provider goes down, this solution will ensure that the Polkadot node is still up given that two other regions are still up and thus the Consul cluster can reach the quorum majority.

## Project structure overview

This project contains 4 folders.
This project contains 7 folders.

### [CircleCI](.circleci/)

Expand All @@ -32,15 +33,31 @@ This folder contains Terraform configuration scripts that will deploy the failov

These scripts will deploy the following architecture components:

![AWS Design](architecture.png "AWS Design architecture")
![AWS Design](aws-architecture.png "AWS Design architecture")

### [Azure](azure/)

This folder contains Terraform configuration scripts that will deploy the failover solution to Azure cloud. Use [terraform.tfvars.example](azure/terraform.tfvars.example) file to see the very minimum configuration required to deploy solution. See README inside of Azure folder for more details.

![Azure Design](azure-architecture.png "Azure Design architecture")

### [GCP](gcp/)

This folder contains Terraform configuration scripts that will deploy the failover solution to Google Cloud Platform. Use [terraform.tfvars.example](gcp/terraform.tfvars.example) file to see the very minimum configuration required to deploy solution. See README inside of GCP folder for more details.

![GCP Design](gcp-architecture.png "GCP Design architecture")

### [Docker](docker/)

This folder contains the Dockerfile for the Docker image that is published on DockerHub.

### [Tests](tests/)

This folder contains a set of tests to be run through CI mechanism. These tests can be launched manually. Simply go to the tests folder, then select provider to check solution at, open scripts and read the set of environment variables you need to export. Export these variables, install [GoLang](https://golang.org/doc/install) and execute the `go test` command to run the CI tests manually.
This folder contains a set of tests to be run through CI mechanism. These tests can be launched manually. Simply go to the tests folder, then select provider to check solution at, open scripts and read the set of environment variables you need to export. Export these variables, install [GoLang](https://golang.org/doc/install) and execute the `go mod init project` and then `go test` commands to run the CI tests manually.

### [Init helpers](init-helpers/)

This folder contains a set of configuration and shell script files that are required for solution to work. VMs downloads these files during startup.

# About us

Expand All @@ -57,4 +74,4 @@ Feel free to contribute by opening issues and PR at this repository. There are n

[![License: Apache v2.0](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0.txt)

This project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE.md) file for details.
This project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE.md) file for details.
Binary file removed architecture.png
Binary file not shown.
27 changes: 11 additions & 16 deletions aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,17 @@

### Prerequisites

You will need an instance supported by [Terraform](https://www.terraform.io/downloads.html) further referred as *Deployer* instance. No specific requirements for these particular scripts needed.
1. An instance further referred as *Deployer* instance which will be used to run these scripts from.
2. [Terraform](https://www.terraform.io/downloads.html). To install Terraform proceed to the [Install Terraform](#install-terraform) section.
3. (Optional) AWS CLI. To install Azure CLI follow [the instruction](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html).

Also you will need a set of keys known as `STASH`, `CONTROLLER` and `SESSION KEYS`. As for this release in `Kusama` and `Westend` network there are 5 keys inside of `SESSION KEYS` object - [GRANDPA, BABE, ImOnline, Parachains, AuthorityDiscovery](https://github.com/paritytech/polkadot/blob/master/runtime/kusama/src/lib.rs#L258). You will have to generate all of them. You can do it either using [Subkey](https://substrate.dev/docs/en/ecosystem/subkey) tool or using [PolkadotJS](https://polkadot.js.org/apps/#/accounts) website.
Also you will need a set of keys known as `NODE KEY`, `STASH`, `CONTROLLER` and `SESSION KEYS`. As for this release in `Kusama` and `Westend` network there are 5 keys inside of `SESSION KEYS` object - [GRANDPA, BABE, ImOnline, Parachains, AuthorityDiscovery](https://github.com/paritytech/polkadot/blob/master/runtime/kusama/src/lib.rs#L258). You will have to generate all of them. You can do it either using [Subkey](https://substrate.dev/docs/en/ecosystem/subkey) tool or using [PolkadotJS](https://polkadot.js.org/apps/#/accounts) website.

#### Keys reference

| Key name | Key short name | Key type |
| ------------------- | -------------- | -------- |
| NODE KEY | - | ed25519 |
| STASH | - | sr25519 |
| CONTROLLER | - | ed25519 |
| GRANDPA | gran | ed25519 |
Expand All @@ -33,14 +36,15 @@ Either clone this repo using `git clone` command or simply download it from Web
### Run the Terraform scripts

1. Open `aws` folder of the cloned (downloaded) repo.
2. Create `terraform.tfvars` file inside of the `aws` folder of the cloned repo, where `terraform.tfvars.example` is located
2. Create `terraform.tfvars` file inside of the `aws` folder of the cloned repo, where `terraform.tfvars.example` is located.
3. Fill it with the appropriate variables. You can check the very minimum example at [example](terraform.tfvars.example) file and the full list of supported variables (and their types) at [variables](variables.tf) file. Fill `validator_keys` variable with your SESSION KEYS. For key types use short types from the following table - [Keys reference](#keys-reference).
4. Set `AWS_ACCESS_KEY` and `AWS_SECRET_KEY` environment variables.
5. (Optional) You can either place a Terraform state file on S3 bucket or on your local machine. To place it on the local machine rename the `remote-state.tf` file to `remote-state.tf.stop`. To place it on S3 - create an S3 bucket and proceed to the next step. You will be interactively asked to provide S3 configuration details.
6. Run `terraform init`
6. Run `terraform init`.
7. Run `terraform plan -out terraform.tfplan` and check the set of resources to be created on your cloud account.
8. If you are okay with the proposed plan - run `terraform apply terraform.tfplan` to apply the deployment.
9. After the deployment is complete you can open your EC2 console to check that the instances were deployed successfully. You can also open Polkadot explorer and ensure that your node is in the validators list.
9. After the deployment is complete you can open your EC2 console to check that the instances were deployed successfully.
10. (Optional) Subscribe to notifications. As for now Terraform does not support automatic email alert creation due to AWS API limitation. Thus, these scripts creates an SNS topic that you should subscribe to manually to start receiving alert messages.

### Validate

Expand All @@ -60,11 +64,10 @@ The argument for sessions.setKeys will be 0xbeaa0ec217371a8559f0d1acfcc4705b4808
Note that there is only one 0x left, all the others are omitted.
```
4. Start validating - perform a `staking.validate` transaction.
5. Subscribe to notifications at AWS Simple Notifications Service to start receiving alarms from your nodes.

# Known issues & limitations

## Prefix should contain alphanumeric characters and have to be short
## Prefix should contain alphanumeric characters only and have to be short

The prefix is used in a majority of resources names, so they can be easily identified among others. This causes the limitation because not all of the deployed resources supports long names or names with non alphanumeric symbols. The optimal is to have around 5 alphanumeric characters as a system prefix.

Expand All @@ -78,12 +81,4 @@ As for now the implemented failover mechanism won't work if 2 out of the 3 chose

## Not all disks are deleted after infrastructure is deleted

This is the intended behavior. Architecture were designed to minimize the risk of data loss, so the validator would not need to resync blockchain data. The disks that sustained the infrastructure destruction contains Polkadot data. Thus, the very same disks can be used when re-raising the infrastructure. If data discs are no longer needed, they can be deleted manually. To do that simply filter them using the `prefix` tag. All the infrastructure components are tagged by `prefix` tag during the deployment.

You can also override this default behavior by setting `delete_on_terminate` variable to `true`.

# Proposed improvements

## Spot instances

The average failure rate for Spot instances is around 1-15% hourly. It can be decreased using the AWS Launch Template with an array of supported Spot instance types. The failover mechanism will reelect another node as the cluster leader, so there will be just a couple of seconds of downtime. Thus, we will drastically reduce the infrastructure costs with a minor decrease of failoverability provided by these scripts.
Set `delete_on_terminate` variable to `true` to override this behavior.
138 changes: 28 additions & 110 deletions aws/modules/regional_infrastructure/files/init.sh.tpl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env bash

# Set quit on error flags
set -x -e -E
set -x -eE

### Function that attaches disk
disk_attach ()
Expand Down Expand Up @@ -140,7 +140,6 @@ for DISK in /dev/nvme?; do
fi
done

usermod -a -G docker ec2-user
# Run docker with regular polkadot container inside of it
/usr/bin/systemctl start docker

Expand All @@ -161,33 +160,6 @@ done
set -eE
trap default_trap ERR EXIT

cat <<EOF >/usr/local/bin/watcher.sh
#!/bin/bash
\$(docker inspect -f "{{.State.Running}}" polkadot && curl -s -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "system_health", "params":[]}' http://127.0.0.1:9933);
STATE=\$?
BLOCK_NUMBER=\$((\$(curl -s -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "chain_getBlock", "params":[]}' http://127.0.0.1:9933 | jq .result.block.header.number -r)))
AMIVALIDATOR=\$(curl -s -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "system_nodeRoles", "params":[]}' http://127.0.0.1:9933 | jq -r .result[0])
if [ "\$AMIVALIDATOR" == "Authority" ]; then
AMIVALIDATOR=1
else
AMIVALIDATOR=0
fi
regions=( ${primary-region} ${secondary-region} ${tertiary-region} )
for i in "\$${regions[@]}"; do
aws cloudwatch put-metric-data --region \$i --metric-name "Health report" --dimensions AutoScalingGroupName=${autoscaling-name} --namespace "${prefix}" --value "\$STATE"
aws cloudwatch put-metric-data --region \$i --metric-name "Block Number" --dimensions InstanceID="\$(curl --silent http://169.254.169.254/latest/meta-data/instance-id)" --namespace "${prefix}" --value "\$BLOCK_NUMBER"
aws cloudwatch put-metric-data --region \$i --metric-name "Validator count" --dimensions AutoScalingGroupName=${autoscaling-name} --namespace "${prefix}" --value "\$AMIVALIDATOR"
done
EOF

chmod 777 /usr/local/bin/watcher.sh

### This will add a crontab entry that will check nodes health from inside the VM and send data to the CloudWatch
(echo '* * * * * /usr/local/bin/watcher.sh') | crontab -

# Clone and install consul
git clone https://github.com/hashicorp/terraform-aws-consul.git
terraform-aws-consul/modules/install-consul/install-consul --version 1.7.2
Expand Down Expand Up @@ -227,81 +199,18 @@ CPU=$(aws ssm get-parameter --region $(curl --silent http://169.254.169.254/late
RAM=$(aws ssm get-parameter --region $(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) --name "/polkadot/validator-failover/${prefix}/ram_limit" | jq -r .Parameter.Value)
NODEKEY=$(aws ssm get-parameter --region $(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) --name "/polkadot/validator-failover/${prefix}/node_key" | jq -r .Parameter.Value)

cat <<EOF >/usr/local/bin/double-signing-control.sh
#!/bin/bash
set -x
BEST=\$(/usr/local/bin/consul kv get best_block)
retVal=\$?
set -eE
echo "Previous validator best block - \$BEST"
if [ "\$retVal" -eq 0 ]; then
VALIDATED=0
until [ "\$VALIDATED" -gt "\$BEST" ]; do
BEST_TEMP=\$(/usr/local/bin/consul kv get best_block)
if [ "\$BEST_TEMP" != "\$BEST" ]; then
consul leave
shutdown now
exit 1
else
BEST=\$BEST_TEMP
VALIDATED=\$(/usr/bin/docker logs polkadot 2>&1 | /usr/bin/grep finalized | /usr/bin/tail -n 1)
VALIDATED=\$(/usr/bin/echo \$${VALIDATED##*#} | /usr/bin/cut -d'(' -f1 | /usr/bin/xargs)
echo "Previous validator best block - \$BEST, new validator validated block - \$VALIDATED"
sleep 10
fi
done
fi
EOF

cat <<EOF >/usr/local/bin/key-insert.sh
#!/bin/bash
set -x -e -E
region="\$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)"
key_names=\$(aws ssm get-parameters-by-path --region \$region --recursive --path /polkadot/validator-failover/${prefix}/keys/ | jq .Parameters[].Name | awk -F'/' '{print\$(NF-1)}' | sort | uniq)
curl -o /usr/local/bin/double-signing-control.sh -L https://raw.githubusercontent.com/protofire/polkadot-failover-mechanism/master/init-helpers/double-signing-control.sh
curl -o /usr/local/bin/best-grep.sh -L https://raw.githubusercontent.com/protofire/polkadot-failover-mechanism/master/init-helpers/best-grep.sh
curl -o /usr/local/bin/key-insert.sh -L https://raw.githubusercontent.com/protofire/polkadot-failover-mechanism/master/init-helpers/aws/key-insert.sh
curl -o /usr/local/bin/watcher.sh -L https://raw.githubusercontent.com/protofire/polkadot-failover-mechanism/master/init-helpers/aws/watcher.sh

for key_name in \$${key_names[@]} ; do
echo "Adding key \$key_name"
SEED="\$(aws ssm get-parameter --with-decryption --region \$region --name /polkadot/validator-failover/${prefix}/keys/\$key_name/seed | jq -r .Parameter.Value)"
KEY="\$(aws ssm get-parameter --region \$region --name /polkadot/validator-failover/${prefix}/keys/\$key_name/key | jq -r .Parameter.Value)"
TYPE="\$(aws ssm get-parameter --region \$region --name /polkadot/validator-failover/${prefix}/keys/\$key_name/type | jq -r .Parameter.Value)"
curl -s -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "author_insertKey", "params":["'"\$TYPE"'","'"\$SEED"'","'"\$KEY"'"]}' http://localhost:9933
done
EOF

cat <<EOF >/usr/local/bin/best-grep.sh
#!/bin/bash
BEST=\$(docker logs polkadot 2>&1 | grep finalized | tail -n 1 | cut -d':' -f4 | cut -d'(' -f1 | cut -d'#' -f2 | xargs)
re='^[0-9]+$'
if [[ "\$BEST" =~ \$re ]] ; then
if [ "\$BEST" -gt 0 ] ; then
/usr/local/bin/consul kv put best_block "\$BEST"
else
echo "Block number either cannot be compared with 0, or not greater than 0"
fi
else
echo "Block number is not a number, skipping block insertion"
fi
sleep 7
EOF

chmod 700 /usr/local/bin/key-insert.sh
chmod 700 /usr/local/bin/best-grep.sh
chmod 700 /usr/local/bin/double-signing-control.sh
chmod 700 /usr/local/bin/best-grep.sh
chmod 700 /usr/local/bin/key-insert.sh
chmod 700 /usr/local/bin/watcher.sh

### This will add a crontab entry that will check nodes health from inside the VM and send data to the CloudWatch
(echo '* * * * * /usr/local/bin/watcher.sh ${prefix} ${autoscaling-name} ${primary-region} ${secondary-region} ${tertiary-region}') | crontab -

# Create lock for the instance
n=0
Expand All @@ -310,17 +219,26 @@ trap - ERR

until [ $n -ge 6 ]; do

/usr/local/bin/consul lock prefix "/usr/local/bin/double-signing-control.sh && /usr/local/bin/key-insert.sh && consul kv delete blocks/.lock && consul lock blocks \"while true; do /usr/local/bin/best-grep.sh; done\" & docker stop polkadot && docker rm polkadot && /usr/bin/docker run --cpus $${CPU} --memory $${RAM}GB --kernel-memory $${RAM}GB --name polkadot --restart unless-stopped -p 30333:30333 -p 127.0.0.1:9933:9933 -v /data:/data chevdor/polkadot:latest polkadot --chain ${chain} --unsafe-rpc-external --rpc-cors=all --validator --name '$NAME' --node-key '$NODEKEY'"
set -eE
trap "/usr/local/bin/consul leave; docker stop polkadot" ERR
node=$(curl -s -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "system_nodeRoles", "params":[]}' http://localhost:9933 | grep Full | wc -l)
if [ "$node" != 1 ]; then
echo "ERROR! Node either does not work or work in not correct way"
/usr/local/bin/consul leave
docker stop polkadot
fi
trap - ERR
set +eE

/usr/local/bin/consul lock prefix "/usr/local/bin/double-signing-control.sh && /usr/local/bin/key-insert.sh ${prefix}; consul kv delete blocks/.lock && consul lock blocks \"while true; do /usr/local/bin/best-grep.sh; done\" & docker stop polkadot && docker rm polkadot && /usr/bin/docker run --cpus $${CPU} --memory $${RAM}GB --kernel-memory $${RAM}GB --name polkadot --restart unless-stopped -p 30333:30333 -p 127.0.0.1:9933:9933 -v /data:/data chevdor/polkadot:latest polkadot --chain ${chain} --unsafe-rpc-external --rpc-cors=all --validator --name '$NAME' --node-key '$NODEKEY'"

/usr/bin/docker stop polkadot || true
/usr/bin/docker rm polkadot || true
/usr/bin/docker run --cpus $CPU --memory $${RAM}GB --kernel-memory $${RAM}GB --name polkadot --restart unless-stopped -d -p 30333:30333 -p 127.0.0.1:9933:9933 -v /data:/data:z chevdor/polkadot:latest polkadot --chain ${chain} --rpc-external --rpc-cors=all --pruning=archive

sleep 10;
n=$[$n+1]

done

set -eE
trap default_trap ERR EXIT

consul leave
shutdown now

# Instance will shutdown when loosing the lock because of no docker container will be running. ASG will replace the instance because it will not pass the ELB health check which monitors all the Consul ports and port 30333 (Polkadot node's port).
default_trap
3 changes: 1 addition & 2 deletions aws/modules/regional_infrastructure/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -96,5 +96,4 @@ variable "expose_ssh" {

variable "node_key" {
description = "A unique ed25519 key that identifies the node"
}

}
6 changes: 6 additions & 0 deletions aws/terraform.tfvars.example
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# This is the very minimum variables example file. You have to put access and private keys for you AWS account, validator keys from Polkadot, a name and a content of your SSH public key that will be used to connect to the instance. For the full list of supported variables see variables.tf file in the root directory of this repo.

## These can be omitted if AWS CLI is configured
# aws_access_keys = ["AAAAAAAAAAA"]
# aws_secret_keys = ["XXXXXXXXXXX"]

# Validator-related variables
validator_name = ""
validator_keys = {
key = {
key="0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
Expand Down
Binary file added azure-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit fc590e4

Please sign in to comment.