More documentation (#280)

* docs: more documentation * infra: different resources depending on environment + docs * docs: more documentation about infrascrutcture and TODOs
alphagov · Feb 7, 2025 · 7db04d5 · 7db04d5
1 parent b582ea0
commit 7db04d5
Show file tree

Hide file tree

Showing 10 changed files with 168 additions and 158 deletions.
diff --git a/README.md b/README.md
@@ -39,6 +39,12 @@ remembering a user's preferences without repeatedly asking for consent.
    clicking a link (eg via a bookmark, or typing in the URL to the address bar
    in your browser), your consent preferences will be remembered.
 
+9. **Audit Logging**: Following the CQRS (Command Query Responsibility Segregation) pattern,
+   whenever consent data is written to the PostgreSQL database, an event is also pushed
+   to a BigQuery dataset. This provides a complete audit trail of all consent changes,
+   enabling future analysis and compliance verification if needed.
+
+
 ## System Architecture
 
 ![System Architecture Diagram](docs/diagram.png)
@@ -58,165 +64,26 @@ docker compose up
 
 ## Installation
 
-To make use of the Single Consent service on your website, please see the
-[Single Consent client Quick Start documentation](client/README.md)
-
-#### Environment variables
-
-- `DATABASE_URL` (default is `postgresql+asyncpg://localhost:5432/consent_api`)
-- `ENV` (default is `development`)
-- `PORT` (default is `8000`)
-- `SECRET_KEY` (default is randomly generated)
-- You can configure the number of web server worker processes with the
-  `WEB_CONCURRENCY` environment variable (default is 1)
-
-## Development
-
 You can run all the services without setup needed:
 
 ```shell
 make docker-build
 docker compose up
 ```
 
-### Loading the environment with direnv
-
-When running docker commands, you will need a few extra environment variables.
-
-It's easiest to use [Direnv](https://direnv.net/) to load the environment.
-
-Copy `.envrc.template` to `.envrc` and load it with direnv:
-
-```shell
-direnv allow
-```
-
-Those variables will be used by both docker-compose and the Makefile.
-
-Additionally, we recommend [hooking direnv with your shell](https://direnv.net/docs/hook.html), for automatic environment loading.
-
-### Run Locally
-
-To run the API locally:
-
-```shell
-make install
-make run
-```
-
-It will install poetry, our python dependencies manager, as well as the project dependencies.
 
-### Testing
+Each time a file is modified in the applications, the container application will restart.
 
-#### Unit tests
 
-Run unit tests with the following command:
+## Integration Tests
 
-```sh
-make test
 ```
-
-#### End-to-end tests
-
-##### Running in Docker Compose
-
-You will need to build a Docker image to run the tests against, using the
-following command:
-
-```sh
-make docker-build
-```
-
-You also need to have the Chrome Docker image already on your system, which you
-can do with the following command:
-
-```sh
-docker pull selenoid/chrome:110.0
-```
-
-> **Note**
-> Currently, Selenoid does not provide a Chrome image that works on Apple M1 hosts. As a
-> workaround, you can use a third-party Chromium image:
->
-> ```sh
-> docker pull sskorol/selenoid_chromium_vnc:100.0
-> ```
->
-> Then set the following environment variable:
->
-> ```sh
-> export SPLINTER_REMOTE_BROWSER_VERSION=sskorol/selenoid_chromium_vnc:100.0
-> ```
-
-The easiest way to run the end-to-end tests is in Docker Compose using the following
-command:
-
-```sh
-make test-end-to-end-docker
+cd apps/consent-api/tests
+BASE_URL=http://localhost:8000 poetry run pytest .
 ```
 
-##### Running locally
-
-To run end-to-end tests you will need Chrome or Firefox installed. Specify which you
-want to use for running tests by setting the `SELENIUM_DRIVER` environment variable
-(defaults to `chrome`), eg:
-
-```sh
-export SELENIUM_DRIVER=firefox
-```
-
-You also need a running instance of the Consent API and two instances of webapps
-which have the Single Consent client installed.
-
-> Note
-> For convenience, a dummy service is included in the API.
-> You can run two more instances of the Consent API on different port numbers to
-> act as dummy services:
->
-> ```sh
-> CONSENT_API_ORIGIN=http://localho.st:8000 OTHER_SERVICE_ORIGIN=http://localho.st:8082 PORT=8081 make run
-> ```
->
-> and
->
-> ```sh
-> CONSENT_API_ORIGIN=http://localho.st:8000 PORT=8082 make run
-> ```
-
-The tests expect to find these available at the following URLs:
-
-| Name            | Env var                      | Default                |
-| --------------- | ---------------------------- | ---------------------- |
-| Consent API     | E2E_TEST_CONSENT_API_URL     | http://localho.st:8000 |
-| Dummy service 1 | E2E_TEST_DUMMY_SERVICE_1_URL | http://localho.st:8080 |
-| Dummy service 2 | E2E_TEST_DUMMY_SERVICE_2_URL | http://localho.st:8081 |
-
-Due to CORS restrictions, the tests will fail if the URL domain is `localhost` or
-`127.0.01`, so a workaround is to use `localho.st` which resolves to `127.0.0.1`.
-
-Run the tests with the following command:
-
-```
-make test-end-to-end
-```
-
-### Branching
-
-This project uses [Github Flow](https://githubflow.github.io/).
-
-- `main` branch is always deployable
-- To work on something new, create a descriptively named branch off `main`
-- Commit to that branch locally and regularly push to the same named branch on the
-  server (Github)
-- When you need feedback or help, or you think the branch is ready to merge, rebase off
-  `main` and open a pull request
-- After the pull request has been reviewed and automated checks have passed, you can
-  merge to `main`
-- Commits to `main` are automatically built, deployed and tested in the Integration
-  environment.
+You can also point the integration tests at the cloud instances by specifying the URL.
 
-New features are developed on feature branches, which must be rebased on the main branch
-and squashed before merging to main.
 
 ## Documentation
 

diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,62 @@
+# Project TODOs and Production Readiness Checklist
+
+## Infrastructure Improvements
+
+### Cloud Run Configuration
+- [ ] Fix default image deployment issue
+  - Current: Terraform deploys hello-world image during updates
+  - Need: Use latest tag or specified image variable
+  - Fallback: Use hello-world only if GCR image doesn't exist
+
+### Performance Optimization
+- [ ] Implement aggressive scaling strategy
+  - [ ] Set lower CPU utilization threshold (around 50%) for production
+  - [ ] Goal: Maintain one spare instance to prevent startup delays
+  - Note: Only apply to production, not staging/development
+
+- [ ] Optimize instance resources
+  - Current: 1 vCPU, 1GB RAM per instance
+  - Proposed: 4 vCPU, 4GB RAM per instance
+  - Benefits:
+    - Reduced need for frequent scaling
+    - Better request latency handling
+    - More efficient Unicorn worker distribution
+
+### Server Optimization
+- [ ] Investigate Unicorn optimization opportunities
+  - Current: Basic configuration
+  - Goal: Improve load distribution and reduce latency
+  - Areas to explore:
+    - Worker process configuration
+    - Connection pooling
+    - Request timeout settings
+
+## Cost-Performance Balance
+- [ ] Evaluate resource allocation strategy
+  - Consider trade-off: Fewer, more powerful pods vs many smaller pods
+  - Focus on optimizing Unicorn configuration for better resource utilization
+  - Balance between scaling speed and resource efficiency
+
+## Notes for Future Development
+- Service not yet in production with departments
+- All scaling and performance configurations should be thoroughly tested before production deployment
+- Monitor startup times and request latency during peak loads
+
+
+## CI/CD and Testing Pipeline
+- [ ] Migrate deployment scripts to GitHub Actions
+  - [ ] Set up deployment workflows for each environment
+  - [ ] Implement proper environment variable handling
+  - [ ] Add deployment approval gates for production
+
+- [ ] Implement automated testing in CI
+  - [ ] Run integration tests in GitHub Actions
+  - [ ] Configure Playwright end-to-end tests
+  - [ ] Set up test reporting and notifications
+
+## Security and Monitoring
+- [ ] Enhance Cloud Armor configuration
+  - [ ] Test and monitor WAF rules
+  - [ ] Verify alert configurations
+  - [ ] Document incident response procedures
+  - [ ] Set up alert notifications for security events
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -37,3 +37,38 @@ The Consent API follows Domain-Driven Design (DDD) and Hexagonal Architecture pr
 3. API processes requests through its layered architecture
 4. Data is stored in PostgreSQL for application state
 5. Consent events are logged to BigQuery for audit purposes
+
+## Infrastructure Scaling Strategy
+
+### Resource Allocation by Environment
+
+The Single Consent service uses environment-specific resource allocation to ensure optimal performance while maintaining cost efficiency. Here's how resources are provisioned across environments:
+
+#### Production Environment
+- **Cloud SQL**: 8 vCPU, 16GB RAM
+- **Cloud Run**: 3-20 instances, 4 CPU cores and 2GB RAM per container
+- **Rationale**: 
+  - High-traffic public service serving millions of UK users
+  - Critical for maintaining low latency across multiple government domains
+  - Aggressive scaling strategy (min 3 instances) to handle traffic spikes without cold starts
+  - Higher resource allocation per instance reduces request latency and improves user experience
+
+#### Staging Environment
+- **Cloud SQL**: 2 vCPU, 4GB RAM
+- **Cloud Run**: 1-2 instances, 1 CPU core and 512MB RAM per container
+- **Purpose**: Testing environment that mirrors production configuration but with reduced resources
+
+#### Development Environment
+- **Cloud SQL**: 2 vCPU, 4GB RAM
+- **Cloud Run**: 1-2 instances, 1 CPU core and 512MB RAM per container
+- **Purpose**: Local development and testing with minimal resource allocation
+
+### Scaling Strategy
+
+The production environment employs an aggressive scaling strategy with a lower CPU utilization threshold (50%) for scaling up. This ensures:
+1. Minimal cold starts by maintaining warm instances
+2. Faster response to traffic spikes
+3. Consistent performance across all government domains
+4. Reduced latency for consent checks and updates
+
+This strategy is particularly important for the Single Consent service as it acts as a central point for cookie consent across multiple government domains, where any performance degradation could impact user experience across the entire gov.uk estate.
diff --git a/infra/README.md b/infra/README.md
@@ -53,6 +53,12 @@ terraform apply
    - variables.tf
    - terraform.tfvars (update values)
 
+## Resource Management
+
+Infrastructure resources (Cloud Run instances and Cloud SQL databases) are managed through environment-specific variables, allowing flexible resource allocation based on environment needs. This enables appropriate scaling from development to production workloads.
+
+For detailed information about environment-specific hardware specifications and the rationale behind our resource allocation strategy, see the [Infrastructure Scaling Strategy](../docs/architecture.md#infrastructure-scaling-strategy) section in our architecture documentation.
+
 ## Module Updates
 
 When making changes to the shared module:

diff --git a/infra/environments/development/terraform.tfvars b/infra/environments/development/terraform.tfvars
@@ -1,16 +1,20 @@
 environment          = "development"
 project_id          = "sde-consent-api"
 region              = "europe-west2"
-domain_name         = "gds-single-consent-dev.app"
+domain_name         = "dev.gds-single-consent.app"
 db_name             = "consent-api"
 
 # Development settings (commented for production testing)
 db_tier             = "db-custom-2-4096"  # 2 vCPU, 4GB RAM for development
 db_version          = "POSTGRES_14"
-db_deletion_protected = false
-min_instances = 2
-max_instances = 5
-container_concurrency = 80
+db_deletion_protected = false  # Allow deletion in development
+
+# Cloud Run configuration for development
+min_instances = 1           # Minimum instances for development
+max_instances = 2           # Maximum 2 instances for development
+container_cpu = "1000m"     # 1 CPU core per container
+container_memory = "512Mi"  # 512MB RAM per container
+container_concurrency = 80  # Same concurrency settings
 
 # Production settings for load testing on the development instance
 # db_tier             = "db-custom-8-16384"  # 8 vCPU, 16GB RAM as in production
@@ -21,4 +25,4 @@ container_concurrency = 80
 # container_concurrency = 80                 # Production concurrency
 
 # Load testing configuration
-load_test_ip = "35.246.19.18"
+load_test_ip = "35.246.19.18"  # IP for load testing in development
diff --git a/infra/environments/production/terraform.tfvars b/infra/environments/production/terraform.tfvars
@@ -8,8 +8,10 @@ db_version          = "POSTGRES_14"
 db_deletion_protected = true
 
 # Cloud Run configuration for high throughput
-min_instances = 3           # Keep more warm instances ready
-max_instances = 10          # Scale up to 10 instances
+min_instances = 3           # Minimum instances for production load
+max_instances = 20          # Scale up to 20 instances for high load
+container_cpu = "4000m"     # 4 CPU cores per container
+container_memory = "2048Mi" # 2GB RAM per container
 container_concurrency = 80  # Optimize for throughput
 
 # Production requires no load test IP

diff --git a/infra/environments/staging/terraform.tfvars b/infra/environments/staging/terraform.tfvars
@@ -1,16 +1,18 @@
 environment          = "staging"
 project_id          = "sde-consent-api"
 region              = "europe-west2"
-domain_name         = "gds-single-consent-staging.app"
+domain_name         = "staging.gds-single-consent.app"
 db_name             = "consent-api"
-db_tier             = "db-custom-4-8192"  # 4 vCPU, 8GB RAM for staging
+db_tier             = "db-custom-2-4096"  # 2 vCPU, 4GB RAM for staging
 db_version          = "POSTGRES_14"
 db_deletion_protected = true
 
-# Cloud Run configuration
-min_instances = 2
-max_instances = 8
-container_concurrency = 80
+# Cloud Run configuration for staging
+min_instances = 1           # Minimum instances for staging
+max_instances = 2          # Maximum 2 instances for staging
+container_cpu = "1000m"     # 1 CPU core per container
+container_memory = "512Mi"  # 512MB RAM per container
+container_concurrency = 80  # Same concurrency settings
 
 # Load testing configuration
 load_test_ip = "35.246.19.18"
diff --git a/infra/modules/consent-api/cloud_run.tf b/infra/modules/consent-api/cloud_run.tf
@@ -29,6 +29,13 @@ resource "google_cloud_run_service" "this" {
       containers {
         image = local.container_image
 
+        resources {
+          limits = {
+            cpu    = var.container_cpu
+            memory = var.container_memory
+          }
+        }
+
         # Mount secrets as environment variables
         env {
           name = "DB_USER"