Skip to content

Commit

Permalink
feat(HA): define and implement Highly Available chall-manager
Browse files Browse the repository at this point in the history
  • Loading branch information
pandatix committed Jan 27, 2024
1 parent 02d7f91 commit 5611701
Show file tree
Hide file tree
Showing 19 changed files with 524 additions and 69 deletions.
26 changes: 26 additions & 0 deletions DESIGN_DOCUMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Table of content:
- [Our proposal](#our-proposal)
- [Goal and perspectives](#goal-and-perspectives)
- [Internals](#internals)
- [High Availability](#high-availability)
- [Deployment](#deployment)
- [Local deployment for developers](#local-deployment-for-developers)
- [Production deployment](#production-deployment)
Expand Down Expand Up @@ -113,6 +114,25 @@ This reproducible generation method is represented as follows.
Being reproducible is necessary to ensure events reproducibility (in case of any infrastructure failure during an event) and avoid players to reconfigure their scripts and tools on every _Challenge Scenario on Demand_ request.
Moreover, it provides consistency between replicas behavior in case of a High-Availability deployment.

### High Availability

Previously, we exposed the chall-manager must reach high availability (denoted _HA_) to perform well at scale e.g. for large events.
In the current section we prove our design can do so.

First of all, for simplicity we want the chall-manager to be database-less (denoted _DB-less_).
To store the _Challenge Scenario_ stacks and their _Challenge Scenario on Demand_ states, we would need an object database as they are files.
Our decision is to write them to the filesystem such that they could be easily stored, shared and replicated through a cluster or machines.
Those needs were solved by using [Longhorn](https://longhorn.io/) for a `PersistentVolumeClaim` that stores the states directory (configurable with the CLI flag `--states-dir`), with an access mode `ReadWriteMany`.

Storing is a thing, race condition is another: if an end-user spams the chall-manager with concurrent requests through the CTF platform, concurrent actions will be performed such as creating an infrastructure.
To avoid this, our design makes use of locks, either using the distributed lock system of [etcd](https://etcd.io/) for HA or generic filelocks (work only on the same host machine, within the same context).
It creates an entry for the identity then locks it. In case of sudden failure, the lock will always be released: etcd will lose contact with the requesting `Pod` thus release the distributed lock, or the filelock will detect the release of the lock with the process being killed.
In the end, this schema enables us to make sure the chall-manager can scale properly while maintaining integrity of the underlying infrastructures.

In our design, we deploy an etcd instance rather than using the Kubernetes already existing one. By doing so, we avoid deep integration of our proposal into the cluster which enables multiple instances to run in parallel inside an already existing cluster. Nevertheless, as etcd could be used a simplistic database our design could be proven non-DB-less, but does not imply it suffers from a limitation.

Moreover, thanks to this design, we provide interoperability with additional systems that can easily integrate the distributed locks and shared volumes. Despite this, we think interesting designs that would do such integrations should discuss it to improve the chall-manager directly.

### Deployment

When deploying resources to a Kubernetes cluster with the necessity of high availability and security, a beginner can only focus on getting the things work. We do not want that because in the design of the chall-manager itself, code is run from distant inputs we can't trust by default (no authentication is part of the chall-manager nor does we want to).
Expand Down Expand Up @@ -190,6 +210,12 @@ The reason here is to avoid maintainance and documentation deltas, as we test th
It separates the namespace the chall-manager is deployed into (which should be the same as the CTF platform instances) to the namespace the challenges are run into. This enable the networking policies to ensure the in-cluster resources that will be compromised won't enable players to pivot to the internal services.
Moreover, thanks to the [SDK](#sdk) the default behavior of created resources is to isolate themselves

The following figure shows the Kubernetes infrastructure that will be deployed. The ontology is the one defined by Kubernetes.

<div align="center">
<img src="deploy/infrastructure.excalidraw.png">
</div>

In case of emergency, an Ops can destroy the whole namespace. This will break the link between the chall-manager and its resources but will enable your cluster to stop permitting players to connect into the cluster.
Integrations should be aware of this scenario and handle that case to recover properly.
Such scenario is realistic as it could also happen through chaos enginerring practices.
Expand Down
29 changes: 16 additions & 13 deletions api/v1/launch/delete.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ import (
"encoding/json"
"os"
"path/filepath"
"sync"

"github.com/ctfer-io/chall-manager/global"
"github.com/ctfer-io/chall-manager/lock"
"github.com/pkg/errors"
"github.com/pulumi/pulumi/sdk/v3/go/auto"
"github.com/pulumi/pulumi/sdk/v3/go/common/apitype"
Expand All @@ -17,20 +17,24 @@ import (
)

func (server *launcherServer) DeleteLaunch(ctx context.Context, req *LaunchRequest) (*emptypb.Empty, error) {
logger := global.Log()

// 1. Generate request identity
id := identity(req.ChallengeId, req.SourceId)

// 2. Make sure only 1 parallel launch for this challenge (avoid overwriting files
// during parallel requests handling).
challLock.Lock()
mx, ok := challLocks[req.ChallengeId]
if !ok {
mx = &sync.Mutex{}
challLocks[req.ChallengeId] = mx
// 2. Make sure only 1 parallel launch for this challenge
// (avoid overwriting files during parallel requests handling).
release, err := lock.Acquire(ctx, id)
if err != nil {
return nil, err
}
mx.Lock()
defer mx.Unlock()
challLock.Unlock()
defer func() {
if err := release(); err != nil {
logger.Error("failed to release lock, could stuck the identity until renewal",
zap.Error(err),
)
}
}()

// 3. Decode+Unzip scenario
dir, err := decodeAndUnzip(req.ChallengeId, req.Scenario)
Expand All @@ -45,8 +49,7 @@ func (server *launcherServer) DeleteLaunch(ctx context.Context, req *LaunchReque
}

// 5. Call factory
global.Log().Info(
"destroying challenge scenario",
logger.Info("destroying challenge scenario",
zap.String("challenge_id", req.ChallengeId),
zap.String("stack_name", stack.Name()),
)
Expand Down
44 changes: 19 additions & 25 deletions api/v1/launch/post.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,39 +10,34 @@ import (
"os"
"path/filepath"
"slices"
"sync"

"github.com/ctfer-io/chall-manager/global"
"github.com/ctfer-io/chall-manager/lock"
"github.com/pkg/errors"
"github.com/pulumi/pulumi/sdk/v3/go/auto"
"go.uber.org/zap"
"gopkg.in/yaml.v3"
)

var (
runtimes = []string{
"go",
}

challLock sync.Mutex
challLocks = map[string]*sync.Mutex{}
)

func (server *launcherServer) CreateLaunch(ctx context.Context, req *LaunchRequest) (*LaunchResponse, error) {
logger := global.Log()

// 1. Generate request identity
id := identity(req.ChallengeId, req.SourceId)

// 2. Make sure only 1 parallel launch for this challenge (avoid overwriting files
// during parallel requests handling).
challLock.Lock()
mx, ok := challLocks[req.ChallengeId]
if !ok {
mx = &sync.Mutex{}
challLocks[req.ChallengeId] = mx
// 2. Make sure only 1 parallel launch for this challenge instance
// (avoid overwriting files during parallel requests handling).
release, err := lock.Acquire(ctx, id)
if err != nil {
return nil, err
}
mx.Lock()
defer mx.Unlock()
challLock.Unlock()
defer func() {
if err := release(); err != nil {
logger.Error("failed to release lock, could stuck the identity until renewal",
zap.Error(err),
)
}
}()

// 3. Decode+Unzip scenario
dir, err := decodeAndUnzip(req.ChallengeId, req.Scenario)
Expand All @@ -67,8 +62,7 @@ func (server *launcherServer) CreateLaunch(ctx context.Context, req *LaunchReque
}

// 6. Call factory
global.Log().Info(
"deploying challenge scenario",
logger.Info("deploying challenge scenario",
zap.String("challenge_id", req.ChallengeId),
zap.String("stack_name", stack.Name()),
)
Expand Down Expand Up @@ -161,8 +155,8 @@ func createStack(ctx context.Context, req *LaunchRequest, dir string) (auto.Stac
return auto.Stack{}, err
}

// Check available runtimes
if !slices.Contains(runtimes, yml.Runtime) {
// Check supported runtimes
if !slices.Contains(global.PulumiRuntimes, yml.Runtime) {
return auto.Stack{}, fmt.Errorf("got unsupported runtime: %s", yml.Runtime)
}

Expand All @@ -173,7 +167,7 @@ func createStack(ctx context.Context, req *LaunchRequest, dir string) (auto.Stac
}
saToken, err := os.ReadFile("/var/run/secrets/kubernetes.io/serviceaccount/token")
if err == nil {
envVars["CM_SATOKEN"] = string(saToken) // transmit the Kubernetes ServiceAccount token to the stack
envVars["CM_SATOKEN"] = string(saToken) // transmit the Kubernetes ServiceAccount projected token to the stack
}
ws, err := auto.NewLocalWorkspace(ctx,
auto.WorkDir(dir),
Expand Down
54 changes: 53 additions & 1 deletion cmd/chall-manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"os"
"os/signal"
"path/filepath"
"slices"
"syscall"

"github.com/ctfer-io/chall-manager/api/v1/launch"
Expand Down Expand Up @@ -87,6 +88,57 @@ func main() {
EnvVars: []string{"TRACING"},
Usage: "If set, turns on tracing through OpenTelemetry (see https://opentelemetry.io) for more info.",
},
&cli.StringFlag{
Name: "lock-kind",
EnvVars: []string{"LOCK_KIND"},
Value: "etcd",
Destination: &global.Conf.Lock.Kind,
Usage: "Define the lock kind to use. It could either be \"ectd\" for Kubernetes-native deployments (recommended) or \"local\" for a flock on the host machine (not scalable but at least handle local replicas).",
Action: func(ctx *cli.Context, s string) error {
if !slices.Contains([]string{"etcd", "local"}, s) {
return errors.New("invalid lock kind value")
}
return nil
},
},
&cli.StringSliceFlag{
Name: "lock-etcd-endpoints",
EnvVars: []string{"LOCK_ETCD_ENDPOINTS"},
Usage: "Define the etcd endpoints to reach for locks.",
Action: func(ctx *cli.Context, s []string) error {
if ctx.String("lock-kind") != "etcd" {
return errors.New("incompatible lock kind with lock-etcd-endpoints, expect etcd")
}

// use action instead of destination to avoid dealing with conversions
global.Conf.Lock.EtcdEndpoints = s
return nil
},
},
&cli.StringFlag{
Name: "lock-etcd-username",
EnvVars: []string{"LOCK_ETCD_USERNAME"},
Destination: &global.Conf.Lock.EtcdUsername,
Usage: "If lock kind is etcd, define the username to use to connect to the etcd cluster.",
Action: func(ctx *cli.Context, s string) error {
if ctx.String("lock-kind") != "etcd" {
return errors.New("incompatible lock kind with lock-etcd-username, expect etcd")
}
return nil
},
},
&cli.StringFlag{
Name: "lock-etcd-password",
EnvVars: []string{"LOCK_ETCD_PASSWORD"},
Destination: &global.Conf.Lock.EtcdPassword,
Usage: "If lock kind is etcd, define the password to use to connect to the etcd cluster.",
Action: func(ctx *cli.Context, s string) error {
if ctx.String("lock-kind") != "etcd" {
return errors.New("incompatible lock kind with lock-etcd-password, expect etcd")
}
return nil
},
},
},
Action: run,
Authors: []*cli.Author{
Expand Down Expand Up @@ -231,7 +283,7 @@ func run(c *cli.Context) error {
}
}

logger.Info("server existing")
logger.Info("server exiting")
return nil
}

Expand Down
Loading

0 comments on commit 5611701

Please sign in to comment.