From 7db04d57aa08353cf66e51d4b9dcae4a8d009930 Mon Sep 17 00:00:00 2001 From: gfry <22219650+guifry@users.noreply.github.com> Date: Fri, 7 Feb 2025 23:24:16 +0000 Subject: [PATCH] More documentation (#280) * docs: more documentation * infra: different resources depending on environment + docs * docs: more documentation about infrascrutcture and TODOs --- README.md | 155 ++---------------- TODO.md | 62 +++++++ docs/architecture.md | 35 ++++ infra/README.md | 6 + .../environments/development/terraform.tfvars | 16 +- .../environments/production/terraform.tfvars | 6 +- infra/environments/staging/terraform.tfvars | 14 +- infra/modules/consent-api/cloud_run.tf | 7 + infra/modules/consent-api/variables.tf | 12 ++ todos.txt | 13 ++ 10 files changed, 168 insertions(+), 158 deletions(-) create mode 100644 TODO.md create mode 100644 todos.txt diff --git a/README.md b/README.md index 79ec9ea..b1611af 100644 --- a/README.md +++ b/README.md @@ -39,6 +39,12 @@ remembering a user's preferences without repeatedly asking for consent. clicking a link (eg via a bookmark, or typing in the URL to the address bar in your browser), your consent preferences will be remembered. +9. **Audit Logging**: Following the CQRS (Command Query Responsibility Segregation) pattern, + whenever consent data is written to the PostgreSQL database, an event is also pushed + to a BigQuery dataset. This provides a complete audit trail of all consent changes, + enabling future analysis and compliance verification if needed. + + ## System Architecture ![System Architecture Diagram](docs/diagram.png) @@ -58,20 +64,6 @@ docker compose up ## Installation -To make use of the Single Consent service on your website, please see the -[Single Consent client Quick Start documentation](client/README.md) - -#### Environment variables - -- `DATABASE_URL` (default is `postgresql+asyncpg://localhost:5432/consent_api`) -- `ENV` (default is `development`) -- `PORT` (default is `8000`) -- `SECRET_KEY` (default is randomly generated) -- You can configure the number of web server worker processes with the - `WEB_CONCURRENCY` environment variable (default is 1) - -## Development - You can run all the services without setup needed: ```shell @@ -79,144 +71,19 @@ make docker-build docker compose up ``` -### Loading the environment with direnv - -When running docker commands, you will need a few extra environment variables. - -It's easiest to use [Direnv](https://direnv.net/) to load the environment. - -Copy `.envrc.template` to `.envrc` and load it with direnv: - -```shell -direnv allow -``` - -Those variables will be used by both docker-compose and the Makefile. - -Additionally, we recommend [hooking direnv with your shell](https://direnv.net/docs/hook.html), for automatic environment loading. - -### Run Locally - -To run the API locally: - -```shell -make install -make run -``` - -It will install poetry, our python dependencies manager, as well as the project dependencies. -### Testing +Each time a file is modified in the applications, the container application will restart. -#### Unit tests -Run unit tests with the following command: +## Integration Tests -```sh -make test ``` - -#### End-to-end tests - -##### Running in Docker Compose - -You will need to build a Docker image to run the tests against, using the -following command: - -```sh -make docker-build -``` - -You also need to have the Chrome Docker image already on your system, which you -can do with the following command: - -```sh -docker pull selenoid/chrome:110.0 -``` - -> **Note** -> Currently, Selenoid does not provide a Chrome image that works on Apple M1 hosts. As a -> workaround, you can use a third-party Chromium image: -> -> ```sh -> docker pull sskorol/selenoid_chromium_vnc:100.0 -> ``` -> -> Then set the following environment variable: -> -> ```sh -> export SPLINTER_REMOTE_BROWSER_VERSION=sskorol/selenoid_chromium_vnc:100.0 -> ``` - -The easiest way to run the end-to-end tests is in Docker Compose using the following -command: - -```sh -make test-end-to-end-docker +cd apps/consent-api/tests +BASE_URL=http://localhost:8000 poetry run pytest . ``` -##### Running locally - -To run end-to-end tests you will need Chrome or Firefox installed. Specify which you -want to use for running tests by setting the `SELENIUM_DRIVER` environment variable -(defaults to `chrome`), eg: - -```sh -export SELENIUM_DRIVER=firefox -``` - -You also need a running instance of the Consent API and two instances of webapps -which have the Single Consent client installed. - -> Note -> For convenience, a dummy service is included in the API. -> You can run two more instances of the Consent API on different port numbers to -> act as dummy services: -> -> ```sh -> CONSENT_API_ORIGIN=http://localho.st:8000 OTHER_SERVICE_ORIGIN=http://localho.st:8082 PORT=8081 make run -> ``` -> -> and -> -> ```sh -> CONSENT_API_ORIGIN=http://localho.st:8000 PORT=8082 make run -> ``` - -The tests expect to find these available at the following URLs: - -| Name | Env var | Default | -| --------------- | ---------------------------- | ---------------------- | -| Consent API | E2E_TEST_CONSENT_API_URL | http://localho.st:8000 | -| Dummy service 1 | E2E_TEST_DUMMY_SERVICE_1_URL | http://localho.st:8080 | -| Dummy service 2 | E2E_TEST_DUMMY_SERVICE_2_URL | http://localho.st:8081 | - -Due to CORS restrictions, the tests will fail if the URL domain is `localhost` or -`127.0.01`, so a workaround is to use `localho.st` which resolves to `127.0.0.1`. - -Run the tests with the following command: - -``` -make test-end-to-end -``` - -### Branching - -This project uses [Github Flow](https://githubflow.github.io/). - -- `main` branch is always deployable -- To work on something new, create a descriptively named branch off `main` -- Commit to that branch locally and regularly push to the same named branch on the - server (Github) -- When you need feedback or help, or you think the branch is ready to merge, rebase off - `main` and open a pull request -- After the pull request has been reviewed and automated checks have passed, you can - merge to `main` -- Commits to `main` are automatically built, deployed and tested in the Integration - environment. +You can also point the integration tests at the cloud instances by specifying the URL. -New features are developed on feature branches, which must be rebased on the main branch -and squashed before merging to main. ## Documentation diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..ca0f81a --- /dev/null +++ b/TODO.md @@ -0,0 +1,62 @@ +# Project TODOs and Production Readiness Checklist + +## Infrastructure Improvements + +### Cloud Run Configuration +- [ ] Fix default image deployment issue + - Current: Terraform deploys hello-world image during updates + - Need: Use latest tag or specified image variable + - Fallback: Use hello-world only if GCR image doesn't exist + +### Performance Optimization +- [ ] Implement aggressive scaling strategy + - [ ] Set lower CPU utilization threshold (around 50%) for production + - [ ] Goal: Maintain one spare instance to prevent startup delays + - Note: Only apply to production, not staging/development + +- [ ] Optimize instance resources + - Current: 1 vCPU, 1GB RAM per instance + - Proposed: 4 vCPU, 4GB RAM per instance + - Benefits: + - Reduced need for frequent scaling + - Better request latency handling + - More efficient Unicorn worker distribution + +### Server Optimization +- [ ] Investigate Unicorn optimization opportunities + - Current: Basic configuration + - Goal: Improve load distribution and reduce latency + - Areas to explore: + - Worker process configuration + - Connection pooling + - Request timeout settings + +## Cost-Performance Balance +- [ ] Evaluate resource allocation strategy + - Consider trade-off: Fewer, more powerful pods vs many smaller pods + - Focus on optimizing Unicorn configuration for better resource utilization + - Balance between scaling speed and resource efficiency + +## Notes for Future Development +- Service not yet in production with departments +- All scaling and performance configurations should be thoroughly tested before production deployment +- Monitor startup times and request latency during peak loads + + +## CI/CD and Testing Pipeline +- [ ] Migrate deployment scripts to GitHub Actions + - [ ] Set up deployment workflows for each environment + - [ ] Implement proper environment variable handling + - [ ] Add deployment approval gates for production + +- [ ] Implement automated testing in CI + - [ ] Run integration tests in GitHub Actions + - [ ] Configure Playwright end-to-end tests + - [ ] Set up test reporting and notifications + +## Security and Monitoring +- [ ] Enhance Cloud Armor configuration + - [ ] Test and monitor WAF rules + - [ ] Verify alert configurations + - [ ] Document incident response procedures + - [ ] Set up alert notifications for security events diff --git a/docs/architecture.md b/docs/architecture.md index 3fa68af..cce8cda 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -37,3 +37,38 @@ The Consent API follows Domain-Driven Design (DDD) and Hexagonal Architecture pr 3. API processes requests through its layered architecture 4. Data is stored in PostgreSQL for application state 5. Consent events are logged to BigQuery for audit purposes + +## Infrastructure Scaling Strategy + +### Resource Allocation by Environment + +The Single Consent service uses environment-specific resource allocation to ensure optimal performance while maintaining cost efficiency. Here's how resources are provisioned across environments: + +#### Production Environment +- **Cloud SQL**: 8 vCPU, 16GB RAM +- **Cloud Run**: 3-20 instances, 4 CPU cores and 2GB RAM per container +- **Rationale**: + - High-traffic public service serving millions of UK users + - Critical for maintaining low latency across multiple government domains + - Aggressive scaling strategy (min 3 instances) to handle traffic spikes without cold starts + - Higher resource allocation per instance reduces request latency and improves user experience + +#### Staging Environment +- **Cloud SQL**: 2 vCPU, 4GB RAM +- **Cloud Run**: 1-2 instances, 1 CPU core and 512MB RAM per container +- **Purpose**: Testing environment that mirrors production configuration but with reduced resources + +#### Development Environment +- **Cloud SQL**: 2 vCPU, 4GB RAM +- **Cloud Run**: 1-2 instances, 1 CPU core and 512MB RAM per container +- **Purpose**: Local development and testing with minimal resource allocation + +### Scaling Strategy + +The production environment employs an aggressive scaling strategy with a lower CPU utilization threshold (50%) for scaling up. This ensures: +1. Minimal cold starts by maintaining warm instances +2. Faster response to traffic spikes +3. Consistent performance across all government domains +4. Reduced latency for consent checks and updates + +This strategy is particularly important for the Single Consent service as it acts as a central point for cookie consent across multiple government domains, where any performance degradation could impact user experience across the entire gov.uk estate. diff --git a/infra/README.md b/infra/README.md index ccc117b..085cd21 100644 --- a/infra/README.md +++ b/infra/README.md @@ -53,6 +53,12 @@ terraform apply - variables.tf - terraform.tfvars (update values) +## Resource Management + +Infrastructure resources (Cloud Run instances and Cloud SQL databases) are managed through environment-specific variables, allowing flexible resource allocation based on environment needs. This enables appropriate scaling from development to production workloads. + +For detailed information about environment-specific hardware specifications and the rationale behind our resource allocation strategy, see the [Infrastructure Scaling Strategy](../docs/architecture.md#infrastructure-scaling-strategy) section in our architecture documentation. + ## Module Updates When making changes to the shared module: diff --git a/infra/environments/development/terraform.tfvars b/infra/environments/development/terraform.tfvars index bd16ab4..198fcc6 100644 --- a/infra/environments/development/terraform.tfvars +++ b/infra/environments/development/terraform.tfvars @@ -1,16 +1,20 @@ environment = "development" project_id = "sde-consent-api" region = "europe-west2" -domain_name = "gds-single-consent-dev.app" +domain_name = "dev.gds-single-consent.app" db_name = "consent-api" # Development settings (commented for production testing) db_tier = "db-custom-2-4096" # 2 vCPU, 4GB RAM for development db_version = "POSTGRES_14" -db_deletion_protected = false -min_instances = 2 -max_instances = 5 -container_concurrency = 80 +db_deletion_protected = false # Allow deletion in development + +# Cloud Run configuration for development +min_instances = 1 # Minimum instances for development +max_instances = 2 # Maximum 2 instances for development +container_cpu = "1000m" # 1 CPU core per container +container_memory = "512Mi" # 512MB RAM per container +container_concurrency = 80 # Same concurrency settings # Production settings for load testing on the development instance # db_tier = "db-custom-8-16384" # 8 vCPU, 16GB RAM as in production @@ -21,4 +25,4 @@ container_concurrency = 80 # container_concurrency = 80 # Production concurrency # Load testing configuration -load_test_ip = "35.246.19.18" +load_test_ip = "35.246.19.18" # IP for load testing in development diff --git a/infra/environments/production/terraform.tfvars b/infra/environments/production/terraform.tfvars index aaff140..f7ef183 100644 --- a/infra/environments/production/terraform.tfvars +++ b/infra/environments/production/terraform.tfvars @@ -8,8 +8,10 @@ db_version = "POSTGRES_14" db_deletion_protected = true # Cloud Run configuration for high throughput -min_instances = 3 # Keep more warm instances ready -max_instances = 10 # Scale up to 10 instances +min_instances = 3 # Minimum instances for production load +max_instances = 20 # Scale up to 20 instances for high load +container_cpu = "4000m" # 4 CPU cores per container +container_memory = "2048Mi" # 2GB RAM per container container_concurrency = 80 # Optimize for throughput # Production requires no load test IP diff --git a/infra/environments/staging/terraform.tfvars b/infra/environments/staging/terraform.tfvars index ae333b9..6a1005b 100644 --- a/infra/environments/staging/terraform.tfvars +++ b/infra/environments/staging/terraform.tfvars @@ -1,16 +1,18 @@ environment = "staging" project_id = "sde-consent-api" region = "europe-west2" -domain_name = "gds-single-consent-staging.app" +domain_name = "staging.gds-single-consent.app" db_name = "consent-api" -db_tier = "db-custom-4-8192" # 4 vCPU, 8GB RAM for staging +db_tier = "db-custom-2-4096" # 2 vCPU, 4GB RAM for staging db_version = "POSTGRES_14" db_deletion_protected = true -# Cloud Run configuration -min_instances = 2 -max_instances = 8 -container_concurrency = 80 +# Cloud Run configuration for staging +min_instances = 1 # Minimum instances for staging +max_instances = 2 # Maximum 2 instances for staging +container_cpu = "1000m" # 1 CPU core per container +container_memory = "512Mi" # 512MB RAM per container +container_concurrency = 80 # Same concurrency settings # Load testing configuration load_test_ip = "35.246.19.18" diff --git a/infra/modules/consent-api/cloud_run.tf b/infra/modules/consent-api/cloud_run.tf index 08424c3..0a84a47 100644 --- a/infra/modules/consent-api/cloud_run.tf +++ b/infra/modules/consent-api/cloud_run.tf @@ -29,6 +29,13 @@ resource "google_cloud_run_service" "this" { containers { image = local.container_image + resources { + limits = { + cpu = var.container_cpu + memory = var.container_memory + } + } + # Mount secrets as environment variables env { name = "DB_USER" diff --git a/infra/modules/consent-api/variables.tf b/infra/modules/consent-api/variables.tf index 89f450b..f861b8c 100644 --- a/infra/modules/consent-api/variables.tf +++ b/infra/modules/consent-api/variables.tf @@ -50,6 +50,18 @@ variable "bigquery_dataset_id" { default = "consent_audit_logs" } +variable "container_cpu" { + description = "CPU cores for Cloud Run container (e.g., '1000m' for 1 core, '4000m' for 4 cores)" + type = string + default = "1000m" +} + +variable "container_memory" { + description = "Memory for Cloud Run container (e.g., '512Mi' for 512MB, '2048Mi' for 2GB)" + type = string + default = "512Mi" +} + variable "alert_email" { description = "Email address for monitoring alerts" type = string diff --git a/todos.txt b/todos.txt new file mode 100644 index 0000000..4cf61bc --- /dev/null +++ b/todos.txt @@ -0,0 +1,13 @@ +Be careful because at the moment every time we deploy the infrared terraform laser changes the cloud run instance it will deploy the default hello world image which means the service will be down until the actual image latest tag is deployed. You might want to change that in terraform. To use the latest tag image or an image specified in the variable and by default if nothing then I mean if the image doesn't exist in GCR then use the hello world default image from cloud run. + +Well, SCON is not deployed to production with running departments using it yet. The production Cloud SQL instance has low specs, but however, when it's ready to move to production, I recommend upgrading it to 16 gigabytes of RAM and eight CPU cores in order to withstand the traffic of millions of users in the United Kingdom. + +Make sure to have different instance specs for the database and Cloud Moon pods for development, staging and prod. Development and staging shouldn't need more than two pods and they can have the minimum specs. But production should have a minimum of three pods and a maximum of 20 pods and good specs. So four cores in the CPU and at least two gigs of RAM each. + +A good idea is to aggressively scale up to avoid having bad traffic because of the delay and latency of an instance to startup. For example, if we have two instances being utilized and the traffic starts getting too much, then it will spawn the third one, but that will take some time, perhaps 30 seconds during which the request could fail or be delayed. So the idea is to always have one more instance than needed in production so that we never get those delays. And the way to do that is to limit the CPU utilization to a low figure, a low threshold for scaling up, such as 50%. Of course, that's a setting for prediction, not staging, no development. + + +There is no UvCon optimization apparently, it's just a simple command and perhaps optimizing the server will help spread the load and make the server faster with less request, delay, and time out and latency. Investigate that. And that's from the fact that even with lots of pods and strong specs such as 4 core CPU and 4 G + +Also at the moment it's 1G, 1V, VRCPU per instance but you might want to look into having 4 VRCPUs and 4G per instance, per pod. So that we don't have to scale as much and also there is less request latency because lots of that is distributed by Junicorn. And we can just do that by modifying the Junicorn command in the Makefile. We could make it cheaper and faster, I think is the best way to cut costs. Just that software change as well as resource change in Cloudrun rather than scaling up the pods. We might need to scale up the pods a lot in Cloudrun but not as much, not as slowly as if we have multiple virtual CPUs.:wq +