feat(HA): define and implement Highly Available chall-manager

ctfer-io · Jan 27, 2024 · 5611701 · 5611701
1 parent 02d7f91
commit 5611701
Show file tree

Hide file tree

Showing 19 changed files with 524 additions and 69 deletions.
diff --git a/DESIGN_DOCUMENT.md b/DESIGN_DOCUMENT.md
@@ -5,6 +5,7 @@ Table of content:
 - [Our proposal](#our-proposal)
   - [Goal and perspectives](#goal-and-perspectives)
   - [Internals](#internals)
+  - [High Availability](#high-availability)
   - [Deployment](#deployment)
     - [Local deployment for developers](#local-deployment-for-developers)
 	- [Production deployment](#production-deployment)
@@ -113,6 +114,25 @@ This reproducible generation method is represented as follows.
 Being reproducible is necessary to ensure events reproducibility (in case of any infrastructure failure during an event) and avoid players to reconfigure their scripts and tools on every _Challenge Scenario on Demand_ request.
 Moreover, it provides consistency between replicas behavior in case of a High-Availability deployment.
 
+### High Availability
+
+Previously, we exposed the chall-manager must reach high availability (denoted _HA_) to perform well at scale e.g. for large events.
+In the current section we prove our design can do so.
+
+First of all, for simplicity we want the chall-manager to be database-less (denoted _DB-less_).
+To store the _Challenge Scenario_ stacks and their _Challenge Scenario on Demand_ states, we would need an object database as they are files.
+Our decision is to write them to the filesystem such that they could be easily stored, shared and replicated through a cluster or machines.
+Those needs were solved by using [Longhorn](https://longhorn.io/) for a `PersistentVolumeClaim` that stores the states directory (configurable with the CLI flag `--states-dir`), with an access mode `ReadWriteMany`.
+
+Storing is a thing, race condition is another: if an end-user spams the chall-manager with concurrent requests through the CTF platform, concurrent actions will be performed such as creating an infrastructure.
+To avoid this, our design makes use of locks, either using the distributed lock system of [etcd](https://etcd.io/) for HA or generic filelocks (work only on the same host machine, within the same context).
+It creates an entry for the identity then locks it. In case of sudden failure, the lock will always be released: etcd will lose contact with the requesting `Pod` thus release the distributed lock, or the filelock will detect the release of the lock with the process being killed.
+In the end, this schema enables us to make sure the chall-manager can scale properly while maintaining integrity of the underlying infrastructures.
+
+In our design, we deploy an etcd instance rather than using the Kubernetes already existing one. By doing so, we avoid deep integration of our proposal into the cluster which enables multiple instances to run in parallel inside an already existing cluster. Nevertheless, as etcd could be used a simplistic database our design could be proven non-DB-less, but does not imply it suffers from a limitation.
+
+Moreover, thanks to this design, we provide interoperability with additional systems that can easily integrate the distributed locks and shared volumes. Despite this, we think interesting designs that would do such integrations should discuss it to improve the chall-manager directly.
+
 ### Deployment
 
 When deploying resources to a Kubernetes cluster with the necessity of high availability and security, a beginner can only focus on getting the things work. We do not want that because in the design of the chall-manager itself, code is run from distant inputs we can't trust by default (no authentication is part of the chall-manager nor does we want to).
@@ -190,6 +210,12 @@ The reason here is to avoid maintainance and documentation deltas, as we test th
 It separates the namespace the chall-manager is deployed into (which should be the same as the CTF platform instances) to the namespace the challenges are run into. This enable the networking policies to ensure the in-cluster resources that will be compromised won't enable players to pivot to the internal services.
 Moreover, thanks to the [SDK](#sdk) the default behavior of created resources is to isolate themselves 
 
+The following figure shows the Kubernetes infrastructure that will be deployed. The ontology is the one defined by Kubernetes.
+
+<div align="center">
+	<img src="deploy/infrastructure.excalidraw.png">
+</div>
+
 In case of emergency, an Ops can destroy the whole namespace. This will break the link between the chall-manager and its resources but will enable your cluster to stop permitting players to connect into the cluster.
 Integrations should be aware of this scenario and handle that case to recover properly.
 Such scenario is realistic as it could also happen through chaos enginerring practices.

diff --git a/api/v1/launch/delete.go b/api/v1/launch/delete.go
@@ -5,9 +5,9 @@ import (
 	"encoding/json"
 	"os"
 	"path/filepath"
-	"sync"
 
 	"github.com/ctfer-io/chall-manager/global"
+	"github.com/ctfer-io/chall-manager/lock"
 	"github.com/pkg/errors"
 	"github.com/pulumi/pulumi/sdk/v3/go/auto"
 	"github.com/pulumi/pulumi/sdk/v3/go/common/apitype"
@@ -17,20 +17,24 @@ import (
 )
 
 func (server *launcherServer) DeleteLaunch(ctx context.Context, req *LaunchRequest) (*emptypb.Empty, error) {
+	logger := global.Log()
+
 	// 1. Generate request identity
 	id := identity(req.ChallengeId, req.SourceId)
 
-	// 2. Make sure only 1 parallel launch for this challenge (avoid overwriting files
-	// during parallel requests handling).
-	challLock.Lock()
-	mx, ok := challLocks[req.ChallengeId]
-	if !ok {
-		mx = &sync.Mutex{}
-		challLocks[req.ChallengeId] = mx
+	// 2. Make sure only 1 parallel launch for this challenge
+	// (avoid overwriting files during parallel requests handling).
+	release, err := lock.Acquire(ctx, id)
+	if err != nil {
+		return nil, err
 	}
-	mx.Lock()
-	defer mx.Unlock()
-	challLock.Unlock()
+	defer func() {
+		if err := release(); err != nil {
+			logger.Error("failed to release lock, could stuck the identity until renewal",
+				zap.Error(err),
+			)
+		}
+	}()
 
 	// 3. Decode+Unzip scenario
 	dir, err := decodeAndUnzip(req.ChallengeId, req.Scenario)
@@ -45,8 +49,7 @@ func (server *launcherServer) DeleteLaunch(ctx context.Context, req *LaunchReque
 	}
 
 	// 5. Call factory
-	global.Log().Info(
-		"destroying challenge scenario",
+	logger.Info("destroying challenge scenario",
 		zap.String("challenge_id", req.ChallengeId),
 		zap.String("stack_name", stack.Name()),
 	)

diff --git a/api/v1/launch/post.go b/api/v1/launch/post.go
@@ -10,39 +10,34 @@ import (
 	"os"
 	"path/filepath"
 	"slices"
-	"sync"
 
 	"github.com/ctfer-io/chall-manager/global"
+	"github.com/ctfer-io/chall-manager/lock"
 	"github.com/pkg/errors"
 	"github.com/pulumi/pulumi/sdk/v3/go/auto"
 	"go.uber.org/zap"
 	"gopkg.in/yaml.v3"
 )
 
-var (
-	runtimes = []string{
-		"go",
-	}
-
-	challLock  sync.Mutex
-	challLocks = map[string]*sync.Mutex{}
-)
-
 func (server *launcherServer) CreateLaunch(ctx context.Context, req *LaunchRequest) (*LaunchResponse, error) {
+	logger := global.Log()
+
 	// 1. Generate request identity
 	id := identity(req.ChallengeId, req.SourceId)
 
-	// 2. Make sure only 1 parallel launch for this challenge (avoid overwriting files
-	// during parallel requests handling).
-	challLock.Lock()
-	mx, ok := challLocks[req.ChallengeId]
-	if !ok {
-		mx = &sync.Mutex{}
-		challLocks[req.ChallengeId] = mx
+	// 2. Make sure only 1 parallel launch for this challenge instance
+	// (avoid overwriting files during parallel requests handling).
+	release, err := lock.Acquire(ctx, id)
+	if err != nil {
+		return nil, err
 	}
-	mx.Lock()
-	defer mx.Unlock()
-	challLock.Unlock()
+	defer func() {
+		if err := release(); err != nil {
+			logger.Error("failed to release lock, could stuck the identity until renewal",
+				zap.Error(err),
+			)
+		}
+	}()
 
 	// 3. Decode+Unzip scenario
 	dir, err := decodeAndUnzip(req.ChallengeId, req.Scenario)
@@ -67,8 +62,7 @@ func (server *launcherServer) CreateLaunch(ctx context.Context, req *LaunchReque
 	}
 
 	// 6. Call factory
-	global.Log().Info(
-		"deploying challenge scenario",
+	logger.Info("deploying challenge scenario",
 		zap.String("challenge_id", req.ChallengeId),
 		zap.String("stack_name", stack.Name()),
 	)
@@ -161,8 +155,8 @@ func createStack(ctx context.Context, req *LaunchRequest, dir string) (auto.Stac
 		return auto.Stack{}, err
 	}
 
-	// Check available runtimes
-	if !slices.Contains(runtimes, yml.Runtime) {
+	// Check supported runtimes
+	if !slices.Contains(global.PulumiRuntimes, yml.Runtime) {
 		return auto.Stack{}, fmt.Errorf("got unsupported runtime: %s", yml.Runtime)
 	}
 
@@ -173,7 +167,7 @@ func createStack(ctx context.Context, req *LaunchRequest, dir string) (auto.Stac
 	}
 	saToken, err := os.ReadFile("/var/run/secrets/kubernetes.io/serviceaccount/token")
 	if err == nil {
-		envVars["CM_SATOKEN"] = string(saToken) // transmit the Kubernetes ServiceAccount token to the stack
+		envVars["CM_SATOKEN"] = string(saToken) // transmit the Kubernetes ServiceAccount projected token to the stack
 	}
 	ws, err := auto.NewLocalWorkspace(ctx,
 		auto.WorkDir(dir),

diff --git a/cmd/chall-manager/main.go b/cmd/chall-manager/main.go
@@ -8,6 +8,7 @@ import (
 	"os"
 	"os/signal"
 	"path/filepath"
+	"slices"
 	"syscall"
 
 	"github.com/ctfer-io/chall-manager/api/v1/launch"
@@ -87,6 +88,57 @@ func main() {
 				EnvVars: []string{"TRACING"},
 				Usage:   "If set, turns on tracing through OpenTelemetry (see https://opentelemetry.io) for more info.",
 			},
+			&cli.StringFlag{
+				Name:        "lock-kind",
+				EnvVars:     []string{"LOCK_KIND"},
+				Value:       "etcd",
+				Destination: &global.Conf.Lock.Kind,
+				Usage:       "Define the lock kind to use. It could either be \"ectd\" for Kubernetes-native deployments (recommended) or \"local\" for a flock on the host machine (not scalable but at least handle local replicas).",
+				Action: func(ctx *cli.Context, s string) error {
+					if !slices.Contains([]string{"etcd", "local"}, s) {
+						return errors.New("invalid lock kind value")
+					}
+					return nil
+				},
+			},
+			&cli.StringSliceFlag{
+				Name:    "lock-etcd-endpoints",
+				EnvVars: []string{"LOCK_ETCD_ENDPOINTS"},
+				Usage:   "Define the etcd endpoints to reach for locks.",
+				Action: func(ctx *cli.Context, s []string) error {
+					if ctx.String("lock-kind") != "etcd" {
+						return errors.New("incompatible lock kind with lock-etcd-endpoints, expect etcd")
+					}
+
+					// use action instead of destination to avoid dealing with conversions
+					global.Conf.Lock.EtcdEndpoints = s
+					return nil
+				},
+			},
+			&cli.StringFlag{
+				Name:        "lock-etcd-username",
+				EnvVars:     []string{"LOCK_ETCD_USERNAME"},
+				Destination: &global.Conf.Lock.EtcdUsername,
+				Usage:       "If lock kind is etcd, define the username to use to connect to the etcd cluster.",
+				Action: func(ctx *cli.Context, s string) error {
+					if ctx.String("lock-kind") != "etcd" {
+						return errors.New("incompatible lock kind with lock-etcd-username, expect etcd")
+					}
+					return nil
+				},
+			},
+			&cli.StringFlag{
+				Name:        "lock-etcd-password",
+				EnvVars:     []string{"LOCK_ETCD_PASSWORD"},
+				Destination: &global.Conf.Lock.EtcdPassword,
+				Usage:       "If lock kind is etcd, define the password to use to connect to the etcd cluster.",
+				Action: func(ctx *cli.Context, s string) error {
+					if ctx.String("lock-kind") != "etcd" {
+						return errors.New("incompatible lock kind with lock-etcd-password, expect etcd")
+					}
+					return nil
+				},
+			},
 		},
 		Action: run,
 		Authors: []*cli.Author{
@@ -231,7 +283,7 @@ func run(c *cli.Context) error {
 		}
 	}
 
-	logger.Info("server existing")
+	logger.Info("server exiting")
 	return nil
 }