Merge branch 'main' into 697-reduce-e2e-test-runtime

defenseunicorns · Jul 29, 2024 · 28de723 · 28de723
2 parents 3a5d7d7 + 89ff0a6
commit 28de723
Show file tree

Hide file tree

Showing 26 changed files with 484 additions and 32 deletions.
diff --git a/.github/workflows/e2e-playright.yaml b/.github/workflows/e2e-playright.yaml
@@ -63,7 +63,7 @@ jobs:
             python-version-file: 'pyproject.toml'
 
         - name: Install Python Deps
-          run: python -m pip install "."
+          run: python -m pip install ".[dev]"
 
         - name: Setup Node
           uses: actions/setup-node@60edb5dd545a775178f52524783378180af0d1f8 # v4.0.2
@@ -106,16 +106,6 @@ jobs:
             python -m pip install requests
             python -m pytest ./tests/e2e/test_supabase.py -v
 
-        ##########
-        # UI
-        ##########
-        - name: Deploy LFAI-UI
-          run: |
-            make build-ui LOCAL_VERSION=e2e-test
-            docker image prune -af
-            uds zarf package deploy packages/ui/zarf-package-leapfrogai-ui-amd64-e2e-test.tar.zst --confirm
-            rm packages/ui/zarf-package-leapfrogai-ui-amd64-e2e-test.tar.zst
-
         ##########
         # API
         ##########
@@ -131,12 +121,31 @@ jobs:
             python -m pip install requests
             python -m pytest ./tests/e2e/test_api.py -v
 
-        # Run the playwright UI tests using the deployed Supabase endpoint
+        ##########
+        # UI
+        ##########
+        - name: Deploy LFAI-UI
+          run: |
+            make build-ui LOCAL_VERSION=e2e-test
+            docker image prune -af
+            uds zarf package deploy packages/ui/zarf-package-leapfrogai-ui-amd64-e2e-test.tar.zst --confirm
+            rm packages/ui/zarf-package-leapfrogai-ui-amd64-e2e-test.tar.zst
+
+        # Run the playwright UI tests using the deployed Supabase endpoint and upload report as an artifact
         - name: UI/API/Supabase E2E Playwright Tests
           run: |
             cp src/leapfrogai_ui/.env.example src/leapfrogai_ui/.env
             TEST_ENV=CI PUBLIC_DISABLE_KEYCLOAK=true PUBLIC_SUPABASE_ANON_KEY=$ANON_KEY npm --prefix src/leapfrogai_ui run test:integration:ci
 
+        # Upload the Playwright report as an artifact
+        - name: Archive Playwright Report
+          uses: actions/upload-artifact@v4
+          if: ${{ !cancelled() }}
+          with:
+            name: playwright-report
+            path: src/leapfrogai_ui/e2e-report/
+            retention-days: 30
+
         # The UI can be removed after the Playwright tests are finished
         - name: Cleanup UI
           run: |

diff --git a/.github/workflows/pytest.yaml b/.github/workflows/pytest.yaml
@@ -56,7 +56,7 @@ jobs:
         run: docker run -p 50051:50051 -d --name=repeater ghcr.io/defenseunicorns/leapfrogai/repeater:dev
 
       - name: Install Python Deps
-        run: pip install "." "src/leapfrogai_api" "src/leapfrogai_sdk"
+        run: pip install ".[dev]" "src/leapfrogai_api" "src/leapfrogai_sdk"
 
       - name: Run Pytest
         run: python -m pytest tests/pytest -v

diff --git a/adr/0003-database.md b/adr/0003-database.md
@@ -14,7 +14,7 @@
 
 ## Status
 
-PROPOSED
+ACCEPTED
 
 ## Context
 

diff --git a/adr/0006-queueing-high-traffic.md b/adr/0006-queueing-high-traffic.md
@@ -0,0 +1,100 @@
+# Queueing and High Traffic
+
+## Table of Contents
+
+- [Handling High Traffic](#Queueing-and-High-Traffic)
+  - [Table of Contents](#table-of-contents)
+  - [Status](#status)
+  - [Background](#background)
+  - [Decision](#decision)
+  - [Rationale](#rationale)
+  - [Alternatives](#alternatives)
+  - [Related ADRs](#related-adrs)
+  - [References](#references)
+
+## Status
+
+PROPOSED
+
+## Context
+
+LeapfrogAI needs to handle a large volume of inference, file upload, and embeddings requests. To ensure that we can manage this level of activity without significant performance degradation, we need to implement systems that prevent overwhelming or blocking by a large volume or single long-running task.
+
+Adding a Queue management component can help create a more efficient request management system to deal with high request volumes of long-running tasks. However, it may introduce a significant level of complexity to the system, and we must weigh the options carefully.
+
+Benefits of a Queue system for request processing:
+- Allows the API to quickly respond even when the system is very busy.
+- Prevents requests from being dropped or timing out.
+- Allows resuming failed requests.
+- Enables throttling of message processing rate.
+
+## Decision
+
+We have decided to implement a multi-tiered approach to address the queueing and high traffic challenges:
+
+1. Address underlying bottlenecks in the system:
+   - Optimize endpoint implementations, processing of long-running tasks, and indexing of files.
+   - Reduce duplication of indexing efforts.
+   - Scale horizontal/vertical resources as needed.
+
+2. Implement a lightweight queueing solution using Supabase Realtime and FastAPI background tasks:
+   - Utilize Supabase Realtime for task status updates (in-progress, complete, etc...) and basic queueing.
+    - In the event of issues with Supabase Realtime, fallback to RedPanda.
+   - Leverage FastAPI's background tasks to handle long running operations asynchronously in the background.
+
+3. Prepare for future scaling by designing the system to easily integrate with a more robust queueing solution:
+   - Design interfaces that can work with both our current lightweight solution and future, more robust options.
+   - Do not attempt to push Supabase Realtime beyond its designed limits, instead plan to use RedPanda or RabbitMQ if those needs surface.
+
+## Rationale
+1. Addressing underlying bottlenecks:
+   - This approach ensures we're not masking performance issues with a queueing system.
+   - Optimizations can significantly improve system performance without adding complexity.
+
+2. Lightweight solution (Supabase Realtime and FastAPI background tasks):
+   - Leverages existing infrastructure (Supabase) reducing additional operational overhead.
+   - FastAPI background tasks provide a simple way to handle asynchronous operations without introducing new dependencies.
+   - This solution meets our current needs without over-engineering.
+
+3. Preparation for future scaling:
+   - Allows for easy transition to more robust solutions as the system grows.
+   - Prevents lock-in to a solution that may not meet future needs.
+
+We chose this approach over alternatives for a few reasons:
+- This tiered approach allows us to start with a simple solution while preparing for future growth.
+- Some alternatives are viable but would likely require significant additional setup and mx work to bring to the current environment.
+  - The additional setup includes but is not limited to: new Zarf packages, updates to uds bundles, spikes to integrate with current app, resolving any permissions/hardening issues, more containers to add to ironbank/chainguard.
+- When performing load testing on the system, the primary bottlenecks seem to be around the vectordb file indexing.
+  - The issues related to this process should be able to be resolved by optimizations, a light amount of queueing, and background tasks.
+  - Issues not related to indexing were primarily scalability issues. Which can be resolved via resource limits, throttling, improving horizontal and vertical scaling within the cluster.
+- Authentication will be an issue for every solution except Supabase Realtime.
+
+## Alternatives
+Queueing Solutions Considered:
+* RabbitMQ: Meets current and future needs.
+  * Well maintained JS and Python libraries.
+  * Requires additional, potentially significant integration work to bring into the k8s cluster.
+* Supabase Realtime: Lightweight and already integrated, but may not meet all future queuing needs.
+  * Well maintained JS and Python libraries.
+  * Can listen directly to db transactions.
+  * Already integrated with Supabase auth.
+* Kafka: Powerful but too heavy for our current requirements.
+  * Well maintained JS and Python libraries.
+  * Requires additional, potentially significant integration work to bring into the k8s cluster.
+* Celery: Good option for Python-based systems, but introduces additional dependencies.
+  * Python library well maintained. JS library not well maintained.
+* RedPanda: Accessible internally and provides a scalable solution.
+  * Well maintained JS and Python libraries as it supports the same tooling as Kafka.
+  * Zarf/UDS bundle already available.
+* Custom Python solution: Flexible but requires significant unnecessary development effort given the tools already available.
+
+## Related ADRs
+* [0003-database](0003-database.md)
+
+## References
+1. Supabase Realtime Documentation: https://supabase.com/docs/guides/realtime
+2. FastAPI Background Tasks: https://fastapi.tiangolo.com/tutorial/background-tasks/
+3. Celery Documentation: https://docs.celeryq.dev/en/stable/
+4. Kafka Documentation: https://kafka.apache.org/
+5. RabbitMQ Documentation: https://www.rabbitmq.com/docs
+6. RedPanda: https://docs.redpanda.com/docs/
diff --git a/packages/k3d-gpu/plugin/device-plugin-daemonset.yaml b/packages/k3d-gpu/plugin/device-plugin-daemonset.yaml
@@ -45,7 +45,7 @@ spec:
           - name: NVIDIA_VISIBLE_DEVICES
             value: all
           - name: NVIDIA_DRIVER_CAPABILITIES
-            value: all
+            value: compute,utility
           - name: MPS_ROOT
             value: /run/nvidia/mps
         securityContext:

diff --git a/packages/text-embeddings/chart/templates/deployment.yaml b/packages/text-embeddings/chart/templates/deployment.yaml
@@ -25,6 +25,11 @@ spec:
       labels:
         {{- include "chart.selectorLabels" . | nindent 8 }}
     spec:
+      {{- if gt (index .Values.resources.limits "nvidia.com/gpu") 0.0 }}
+      runtimeClassName: nvidia
+      {{- else if .Values.gpu.runtimeClassName }}
+      runtimeClassName: {{ .Values.gpu.runtimeClassName }}
+      {{- end }}
       securityContext:
         {{- toYaml .Values.podSecurityContext | nindent 8 }}
       containers:

diff --git a/packages/text-embeddings/embedding-values.yaml b/packages/text-embeddings/embedding-values.yaml
@@ -1,6 +1,9 @@
 image:
   tag: "###ZARF_CONST_IMAGE_VERSION###"
 
+gpu:
+  runtimeClassName: "###ZARF_VAR_GPU_CLASS_NAME###"
+
 resources:
   limits:
-    nvidia.com/gpu: "###ZARF_VAR_GPU_LIMIT###"
+    nvidia.com/gpu: ###ZARF_VAR_GPU_LIMIT###
diff --git a/packages/text-embeddings/zarf.yaml b/packages/text-embeddings/zarf.yaml
@@ -16,6 +16,10 @@ variables:
     description: The GPU limit for the model inferencing.
     default: "0"
     pattern: "^[0-9]+$"
+  - name: GPU_CLASS_NAME
+    description: The GPU class name for the model inferencing. Leave blank for CPU-only.
+    default: ""
+    pattern: "^(nvidia)?$"
 
 components:
   - name: text-embeddings-model

diff --git a/packages/vllm/chart/templates/deployment.yaml b/packages/vllm/chart/templates/deployment.yaml
@@ -25,6 +25,7 @@ spec:
       labels:
         {{- include "chart.selectorLabels" . | nindent 8 }}
     spec:
+      runtimeClassName: {{ .Values.gpu.runtimeClassName }}
       securityContext:
         {{- toYaml .Values.podSecurityContext | nindent 8 }}
       containers:

diff --git a/packages/vllm/vllm-values.yaml b/packages/vllm/vllm-values.yaml
@@ -1,2 +1,5 @@
 image:
   tag: "###ZARF_CONST_IMAGE_VERSION###"
+
+gpu:
+  runtimeClassName: nvidia
diff --git a/packages/whisper/Dockerfile b/packages/whisper/Dockerfile
@@ -26,7 +26,7 @@ RUN pip uninstall -y ctranslate2 transformers[torch]
 RUN pip install packages/whisper/build/lfai_whisper*.whl --no-index --find-links=packages/whisper/build/
 
 # Use hardened ffmpeg image to get compiled binaries
-FROM cgr.dev/chainguard/ffmpeg:latest as ffmpeg
+FROM cgr.dev/chainguard/ffmpeg:latest AS ffmpeg
 
 # hardened and slim python image
 FROM ghcr.io/defenseunicorns/leapfrogai/python:3.11

diff --git a/packages/whisper/chart/templates/deployment.yaml b/packages/whisper/chart/templates/deployment.yaml
@@ -25,6 +25,11 @@ spec:
       labels:
         {{- include "chart.selectorLabels" . | nindent 8 }}
     spec:
+      {{- if gt (index .Values.resources.limits "nvidia.com/gpu") 0.0 }}
+      runtimeClassName: nvidia
+      {{- else if .Values.gpu.runtimeClassName }}
+      runtimeClassName: {{ .Values.gpu.runtimeClassName }}
+      {{- end }}
       securityContext:
         {{- toYaml .Values.podSecurityContext | nindent 8 }}
       containers:

diff --git a/packages/whisper/whisper-values.yaml b/packages/whisper/whisper-values.yaml
@@ -1,6 +1,9 @@
 image:
   tag: "###ZARF_CONST_IMAGE_VERSION###"
 
+gpu:
+  runtimeClassName: "###ZARF_VAR_GPU_CLASS_NAME###"
+
 resources:
   limits:
-    nvidia.com/gpu: "###ZARF_VAR_GPU_LIMIT###"
+    nvidia.com/gpu: ###ZARF_VAR_GPU_LIMIT###
diff --git a/packages/whisper/zarf.yaml b/packages/whisper/zarf.yaml
@@ -16,6 +16,10 @@ variables:
     description: The GPU limit for the model inferencing.
     default: "0"
     pattern: "^[0-9]+$"
+  - name: GPU_CLASS_NAME
+    description: The GPU class name for the model inferencing. Leave blank for CPU-only.
+    default: ""
+    pattern: "^(nvidia)?$"
 
 components:
   - name: whisper-model

diff --git a/pyproject.toml b/pyproject.toml
@@ -13,16 +13,15 @@ license = {file = "LICENSE"}
 dependencies = [  # Dev dependencies needed for all of lfai
     "openai",
     "pip-tools == 7.3.0",
-    "pytest",
-    "pytest-asyncio",
     "httpx",
     "ruff",
-    "python-dotenv",
-    "pytest-asyncio",
-    "requests"
+    "python-dotenv"
 ]
 requires-python = "~=3.11"
 
+[project.optional-dependencies]
+dev = ["locust", "pytest-asyncio", "requests", "requests-toolbelt", "pytest"]
+
 [tool.pip-tools]
 generate-hashes = true
 

diff --git a/src/leapfrogai_api/Makefile b/src/leapfrogai_api/Makefile
@@ -10,6 +10,7 @@ install-api:
 	python -m pip install ../../src/leapfrogai_sdk
 	@cd ${MAKEFILE_DIR} && \
 	python -m pip install -e .
+	python -m pip install "../../.[dev]"
 
 dev-run-api:
 	@cd ${MAKEFILE_DIR} && \

diff --git a/src/leapfrogai_ui/playwright.config.ts b/src/leapfrogai_ui/playwright.config.ts
@@ -72,8 +72,11 @@ const devConfig: PlaywrightTestConfig = {
 // when e2e testing, use the deployed instance
 const CI_Config: PlaywrightTestConfig = {
   use: {
-    baseURL: 'https://ai.uds.dev'
-  }
+    baseURL: 'https://ai.uds.dev',
+    screenshot: 'only-on-failure',
+    video: 'retain-on-failure'
+  },
+  reporter: [['html', { outputFolder: 'e2e-report' }]]
 };
 
 // get the environment type from command line. If none, set it to dev

diff --git a/tests/data/russian.mp3 b/tests/data/russian.mp3
diff --git a/tests/e2e/README.md b/tests/e2e/README.md
@@ -30,7 +30,7 @@ make build-llama-cpp-python
 uds zarf package deploy zarf-package-llama-cpp-python-*.tar.zst
 
 # Install the python dependencies
-python -m pip install "."
+python -m pip install ".[dev]"
 
 # Run the tests!
 # NOTE: Each model backend has its own e2e test files

diff --git a/tests/load/README.md b/tests/load/README.md
@@ -0,0 +1,52 @@
+# LeapfrogAI Load Tests
+
+## Overview
+
+These tests check the API's ability to handle different amounts of load. The tests simulate a specified number of users hitting the endpoints with some number of requests per second.
+
+# Requirements
+
+### Environment Setup
+
+Before running the tests, ensure that your API URL and bearer token are properly configured in your environment variables. Follow these steps:
+
+1. Set the API URL:
+   ```bash
+   export API_URL="https://leapfrogai-api.uds.dev"
+   ```
+
+2. Set the API token:
+   ```bash
+   export BEARER_TOKEN="<your-supabase-jwt-here>"
+   ```
+
+   **Note:** The bearer token should be your Supabase user JWT. For information on generating a JWT, please refer to the [Supabase README.md](../../packages/supabase/README.md). While an API key generated from the LeapfrogAI API endpoint can be used, it will cause the token generation load tests to fail.
+
+3. (Optional) - Set the model backend, this will default to `vllm` if unset:
+      ```bash
+   export DEFAULT_MODEL="llama-cpp-python"
+   ```
+
+## Running the Tests
+
+To start the Locust web interface and run the tests:
+
+1. Install dependencies from the project root.
+   ```bash
+   pip install ".[dev]"
+   ```
+
+2. Navigate to the directory containing `loadtest.py`.
+
+3. Execute the following command:
+   ```bash
+   locust -f loadtest.py --web-port 8089
+   ```
+
+4. Open your web browser and go to `http://0.0.0.0:8089`.
+
+5. Use the Locust web interface to configure and run your tests:
+   - Set the number of users to simulate
+   - Set the spawn rate (users per second)
+   - Choose the host to test against (should match your `API_URL`)
+   - Start the test and monitor results in real-time