diff --git a/HowToRun_vLLM_Models.md b/HowToRun_vLLM_Models.md index fa30371d..8b19c6a5 100644 --- a/HowToRun_vLLM_Models.md +++ b/HowToRun_vLLM_Models.md @@ -1,6 +1,10 @@ -# Running Llama3.1-70B and Mock vLLM Models in TT-Studio +# Running Llama and Mock vLLM Models in TT-Studio -This guide provides step-by-step instructions on setting up and deploying vLLM Llama3.1-70B and vLLM Mock models using TT-Studio. +This guide walks you through setting up vLLM Llama models and vLLM Mock models via the TT-Inference-Server, and then deploying them via TT-Studio. + +## Supported Models + +For the complete and up-to-date list of models supported by TT-Studio via TT-Inference-Server, please refer to [TT-Inference-Server GitHub README](https://github.com/tenstorrent/tt-inference-server/blob/main/README.md). --- @@ -8,9 +12,8 @@ This guide provides step-by-step instructions on setting up and deploying vLLM L 1. **Docker**: Make sure Docker is installed on your system. Follow the [Docker installation guide](https://docs.docker.com/engine/install/). -2. **Hugging Face Token**: Both models require authentication to Hugging Face repositories. To obtain a token, go to [Hugging Face Account](https://huggingface.co/settings/tokens) and generate a token. Additionally; make sure to accept the terms and conditions on Hugging Face for the Llama3.1 models by visiting [Hugging Face Meta-Llama Page](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct). +2. **Hugging Face Token**: Both models require authentication to Hugging Face repositories. To obtain a token, go to [Hugging Face Account](https://huggingface.co/settings/tokens) and generate a token. Additionally; make sure to accept the terms and conditions on Hugging Face for the the desired model(s). -3. **Model Access Weight**: To access specific models like Llama3.1, you may need to register with Meta to obtain download links for model weights. Visit [Llama Downloads](https://www.llama.com/llama-downloads/) for more information. --- ## Instructions Overview @@ -19,15 +22,14 @@ This guide provides step-by-step instructions on setting up and deploying vLLM L 1. [Clone repositories](#1-clone-required-repositories) 2. [Pull the mock model Docker image](#2-pull-the-desired-model-docker-images-using-docker-github-registry) 3. [Set up the Hugging Face (HF) token](#3-set-up-environment-variables-and-hugging-face-token) -4. [Run the mock vLLM model via the GUI](#7-deploy-and-run-the-model) +4. [Deploy and run inference for the model via the GUI](#5-deploy-and-run-the-model) -### **For vLLM Llama3.1-70B Model:** -1. [Clone repositories](#1-clone-required-repositories) -2. [Pull the model Docker image](#2-pull-the-desired-model-docker-images-using-docker-github-registry) -3. [Set up the Hugging Face (HF) token in the TT-Studio `.env` file](#3-set-up-environment-variables-and-hugging-face-token) -4. [Run the model setup script](#4-run-the-setup-script-vllm-llama31-70b-only) -5. [Update the vLLM Environment Variable in Environment File](#6-add-the-vllm-environment-variable-in-environment-file--copy-the-file-over-to-tt-studio-persistent-volume) -6. [Deploy and run inference for the Llama3.1-70B model via the GUI](#7-deploy-and-run-the-model) + +### **For vLLM Llama Model(s):** +1. [Clone repositories](#1-clone-required-repositories) +2. [Pull the model Docker image](#2-pull-the-desired-model-docker-images-using-docker-github-registry) +3. [Run the model setup script](#4-run-the-setup-script) +4. [Deploy and run inference for the model via the GUI](#6-deploy-and-run-the-model) --- @@ -55,128 +57,106 @@ git clone https://github.com/tenstorrent/tt-inference-server 1. **Navigate to the Docker Images:** - Visit [TT-Inference-Server GitHub Packages](https://github.com/orgs/tenstorrent/packages?repo_name=tt-inference-server). -2. **Pull the Docker Image:** +2. **Pull the Desired Model Docker Image:** ```bash - docker pull ghcr.io/tenstorrent/tt-inference-server: + docker pull ghcr.io/tenstorrent/tt-inference-server/:: ``` -3. **Authenticate Your Terminal (Optional):** +3. **Authenticate Your Terminal (Optional - If Pull Command Fails)):** ```bash echo YOUR_PAT | docker login ghcr.io -u YOUR_USERNAME --password-stdin ``` - + --- + ## 3. Set Up Environment Variables and Hugging Face Token -## 3. Set Up Environment Variables and Hugging Face Token - -Add the Hugging Face Token within the `.env` file in the `tt-studio/app/` directory. - -```bash -HF_TOKEN=hf_******** -``` + Add the Hugging Face Token within the `.env` file in the `tt-studio/app/` directory. + ```bash + HF_TOKEN=hf_******** + ``` --- -## 4. Run the Setup Script (vLLM Llama3.1-70B only) +## 4. Run the Setup Script -Follow these step-by-step instructions for a smooth automated process of model weights setup. +Follow these step-by-step instructions to smoothly automate the process of setting up model weights. -1. **Navigate to the `vllm-tt-metal-llama3-70b/` folder** within the `tt-inference-server`. This folder contains the necessary files and scripts for model setup. +1. **Create the `tt_studio_persistent_volume` folder** + - Either create this folder manually inside `tt-studio/`, or run `./startup.sh` from within `tt-studio` to have it created automatically. -2. **Run the automated setup script** as outlined in the [official documentation](https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b#5-automated-setup-environment-variables-and-weights-files:~:text=70b/docs/development-,5.%20Automated%20Setup%3A%20environment%20variables%20and%20weights%20files,-The%20script%20vllm). This script handles key steps such as configuring environment variables, downloading weight files, repacking weights, and creating directories. +2. **Ensure folder permissions** + - Verify that you (the user) have permission to edit the newly created folder. If not, adjust ownership or permissions using commands like `chmod` or `chown`. -**Note** During the setup process, you will see the following prompt: +3. **Navigate to `tt-inference-server`** + - Consult the [README](https://github.com/tenstorrent/tt-inference-server?tab=readme-ov-file#model-implementations) to see which model servers are supported by TT-Studio. - ``` - Enter your PERSISTENT_VOLUME_ROOT [default: tt-inference-server/tt_inference_server_persistent_volume]: - ``` +4. **Run the automated setup script** - **Do not accept the default path.** Instead, set the persistent volume path to `tt-studio/tt_studio_persistent_volume`. This ensures the configuration matches TT-Studio’s directory structure. Using the default path may result in incorrect configuration. + - **Execute the script** + Navigate to `tt-inference-server`, run: + ```bash + ./setup.sh **Model** + ``` + + - **Choose how to provide the model** + You will see: + ``` + How do you want to provide a model? + 1) Download from πŸ€— Hugging Face (default) + 2) Download from Meta + 3) Local folder + Enter your choice: + ``` + For first-time users, we recommend **option 1** (Hugging Face). -By following these instructions, you will have a properly configured model infrastructure, ready for inference and further development. + - **Next Set `PERSISTENT_VOLUME_ROOT`** + The script will prompt you for a `PERSISTENT_VOLUME_ROOT` path. A default path will be suggested, but **do not accept the default**. Instead, specify the **absolute path** to your `tt-studio/tt_studio_persistent_volume` directory to maintain the correct structure. + Using the default path can lead to incorrect configurations. + - **Validate token and set environment variables** + The script will: + 1. Validate your Hugging Face token (`HF_TOKEN`). + 2. Prompt you for an `HF_HOME` location (default is often `~/.cache/huggingface`). + 3. Ask for a JWT secret, which should match the one in `tt-studio/app/.env` (commonly `test-secret-456`). +By following these steps, your tt-inference-server model infrastructure will be correctly configured and ready for inference via the TT-Studio GUI. --- ## 5. Folder Structure for Model Weights -Verify that the weights are correctly stored in the following structure: - -```bash -/path/to/tt-studio/tt_studio_persistent_volume/ -└── volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/ - β”œβ”€β”€ layers_0-4.pth - β”œβ”€β”€ layers_5-9.pth - β”œβ”€β”€ params.json - └── tokenizer.model -``` - -**What to Look For:** -- Ensure all expected weight files (e.g., `layers_0-4.pth`, `params.json`, `tokenizer.model`) are present. -- If any files are missing, re-run the `setup.sh` script to complete the download. - -This folder structure allows TT Studio to automatically recognize and access models without further configuration adjustments. For each model, verify that the weights are correctly copied to this directory to ensure proper access by TT Studio. - - -## 6. Copy the Environment File and Point to it in TT-Studio - -### Step 1: Copy the Environment File -During the model weights download process, an `.env` file will be automatically created. The path to the `.env` file might resemble the following example: - -``` -/path/to/tt-inference-server/vllm-tt-metal-llama3-70b/.env -``` - -To ensure the model can be deployed via the TT-Studio GUI, this `.env` file must be copied to the model's persistent storage location. For example: - -```bash -/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/copied_env -``` - -The following command can be used as a reference (*replace paths as necessary*): - -```bash -sudo cp /$USR/tt-inference-server/vllm-tt-metal-llama3-70b/.env /$USR/tt_studio/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/.env -``` - -### Step 2: Point to the Copied Environment File -The `VLLM_LLAMA31_ENV_FILE` variable within the TT-Studio `$USR/tt-studio/app/.env` file must point to *this* copied `.env` file. This should be a **relative path**, for example it can be set as follows: - -``` -VLLM_LLAMA31_ENV_FILE="/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/.env" -``` ---- +When using the setup script it creates (or updates) specific directories and files within your `tt_studio_persistent_volume` folder. Here’s what to look for: -### Step 2: Update the TT-Studio Environment File -After copying the `.env` file, update the `VLLM_LLAMA31_ENV_FILE` variable in the `tt-studio/app/.env` file to point to the **copied file path**. This ensures TT-Studio uses the correct environment configuration for the model. +1. **Model Weights Directories** + Verify that the weights are correctly stored in a directory similar to: + ```bash + /path/to/tt-studio/tt_studio_persistent_volume/ + β”œβ”€β”€ model_envs + β”‚ └── Llama-3.1-70B-Instruct.env + └── volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/ + β”œβ”€β”€ layers_0-4.pth + β”œβ”€β”€ layers_5-9.pth + β”œβ”€β”€ params.json + └── tokenizer.model -```bash -VLLM_LLAMA31_ENV_FILE="/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/copied_env" -``` + ``` + - Ensure all expected weight files (e.g., `layers_0-4.pth`, `params.json`, `tokenizer.model`) are present. + - If any files are missing, re-run the `setup.sh` script to complete the download. ---- -Here is an example of a complete `.env` file configuration for reference: +2. **`model_envs` Folder** + Within your `tt_studio_persistent_volume`, you will also find a `model_envs` folder (e.g., `model_envs/Llama-3.1-8B-Instruct.env`). + - Each `.env` file contains the values you input during the setup script run (e.g., `HF_TOKEN`, `HF_HOME`, `JWT_SECRET`). + - Verify that these environment variables match what you entered; if you need to adjust them, re-run the setup process. -```bash -TT_STUDIO_ROOT=/Users/**username**/tt-studio -HOST_PERSISTENT_STORAGE_VOLUME=${TT_STUDIO_ROOT}/tt_studio_persistent_volume -INTERNAL_PERSISTENT_STORAGE_VOLUME=/tt_studio_persistent_volume -BACKEND_API_HOSTNAME="tt-studio-backend-api" -VLLM_LLAMA31_ENV_FILE="/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/**copied_env -# SECURITY WARNING: keep these secret in production! -JWT_SECRET=test-secret-456 -DJANGO_SECRET_KEY=django-insecure-default -HF_TOKEN=hf_**** -``` +This folder and file structure allows TT-Studio to automatically recognize and access models without any additional configuration steps. --- -## 7. Deploy and Run the Model +## 6. Deploy and Run the Model 1. **Start TT-Studio:** Run TT-Studio using the startup command. 2. **Access Model Weights:** In the TT-Studio interface, navigate to the model weights section. -3. **Select Custom Weights:** Use the custom weights option to select the weights for Llama3.1-70B. +3. **Select Weights:** Select the model weights. 4. **Run the Model:** Start the model and wait for it to initialize. --- @@ -242,6 +222,30 @@ curl -s --no-buffer -X POST "http://localhost:7000/v1/chat/completions" -H "Cont If successful, you will receive a response from the model. +#### iv. Sample Command for Changing Ownership (chown) + +If you need to adjust permissions for the `tt_studio_persistent_volume` folder, first determine your user and group IDs by running: (*replace paths as necessary*) + +```bash +id +``` + +You will see an output similar to: + +``` +uid=1001(youruser) gid=1001(yourgroup) groups=... +``` + +Use these numeric IDs to set the correct ownership. For example: + +```bash +sudo chown -R 1001:1001 /home/youruser/tt-studio/tt_studio_persistent_volume/ +``` + +Replace `1001:1001` with your actual UID:GID and `/home/youruser/tt-studio/tt_studio_persistent_volume/` with the path to your persistent volume folder. + + + ## You're All Set πŸŽ‰ With the setup complete, you’re ready to run inference on the vLLM models (or any other supported model(s)) within TT-Studio. Refer to the documentation and setup instructions in the repositories for further guidance. \ No newline at end of file diff --git a/README.md b/README.md index e4a2cf4d..ae1ec056 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,11 @@ TT-Studio enables rapid deployment of TT Inference servers locally and is optimi 1. [Prerequisites](#prerequisites) 2. [Overview](#overview) -3. [Quick Start](#quick-start) - - [For General Users](#for-general-users) +3. [Quick Start](#quick-start) + - [For General Users](#for-general-users) + - Clone the Repository + - Set Up the Model Weights. + - Run the App via `startup.sh` - [For Developers](#for-developers) 4. [Using `startup.sh`](#using-startupsh) - [Basic Usage](#basic-usage) @@ -16,7 +19,7 @@ TT-Studio enables rapid deployment of TT Inference servers locally and is optimi 5. [Documentation](#documentation) - [Frontend Documentation](#frontend-documentation) - [Backend API Documentation](#backend-api-documentation) - - [Running Llama3.1-70B in TT-Studio](#running-llama31-70b-in-tt-studio) + - [Running vLLM Models in TT-Studio]) --- @@ -36,8 +39,11 @@ To set up TT-Studio: git clone https://github.com/tenstorrent/tt-studio.git cd tt-studio ``` +2. **Choose and Set Up the Model**: -2. **Run the Startup Script**: + Select your desired model and configure its corresponding weights by following the instructions in [HowToRun_vLLM_Models.md](./HowToRun_vLLM_Models.md). + +3. **Run the Startup Script**: Run the `startup.sh` script: @@ -47,16 +53,16 @@ To set up TT-Studio: #### See this [section](#command-line-options) for more information on command-line arguments available within the startup script. -3. **Access the Application**: +4. **Access the Application**: The app will be available at [http://localhost:3000](http://localhost:3000). -4. **Cleanup**: +5. **Cleanup**: - To stop and remove Docker services, run: ```bash ./startup.sh --cleanup ``` -5. Running on a Remote Machine +6. Running on a Remote Machine To forward traffic between your local machine and a remote server, enabling you to access the frontend application in your local browser, follow these steps: @@ -70,28 +76,60 @@ To set up TT-Studio: > ⚠️ **Note**: To use Tenstorrent hardware, during the run of `startup.sh` script, select "yes" when prompted to mount hardware. This will automatically configure the necessary settings, eliminating manual edits to docker compose.yml. --- -### For Developers +## Running in Development Mode + +Developers can control and run the app directly via `docker compose`, keeping this running in a terminal allows for hot reload of the frontend app. + +1. **Start the Application**: -Developers can control and run the app directly via `docker compose`, keeping this running in a terminal allows for hot reload of the frontend app. For any backend changes its advisable to re restart the services. + Navigate to the project directory and start the application: -1. **Run in Development Mode**: + ```bash + cd tt-studio/app + docker compose up --build + ``` - ```bash - cd tt-studio/app - docker compose up --build - ``` + Alternatively, run the backend and frontend servers interactively: -2. **Stop the Services**: + ```bash + docker compose up + ``` + + To force a rebuild of Docker images: + + ```bash + docker compose up --build + ``` - ```bash - docker compose down - ``` +2. **Hot Reload & Debugging**: + + #### Frontend + - The frontend supports hot reloading when running inside the `docker compose` environment. + - Ensure that the required lines (**71-73**) in `docker-compose.yml` are uncommented. + + #### Backend + - Local files in `./api` are mounted to `/api` within the container for development. + - Code changes trigger an automatic rebuild and redeployment of the Django server. + - To manually start the Django development server: + + ```bash + ./manage.py runserver 0.0.0.0:8000 + ``` + +3. **Stopping the Services**: + + To shut down the application and remove running containers: + + ```bash + docker compose down + ``` -3. **Using the Mock vLLM Model**: - - For local testing, you can use the `Mock vLLM` model, which spits out random set of characters back . Instructions to run it are [here](HowToRun_vLLM_Models.md) +4. **Using the Mock vLLM Model**: + - For local testing, you can use the `Mock vLLM` model, which generates a random set of characters as output. + - Instructions to run it are available [here](./HowToRun_vLLM_Models.md). -4. **Running on a Machine with Tenstorrent Hardware**: +5. **Running on a Machine with Tenstorrent Hardware**: To run TT-Studio on a device with Tenstorrent hardware, you need to uncomment specific lines in the `app/docker-compose.yml` file. Follow these steps: @@ -159,7 +197,7 @@ If a Tenstorrent device (`/dev/tenstorrent`) is detected, the script will prompt - **Backend API Documentation**: [app/api/README.md](app/api/README.md) Information on the backend API, powered by Django Rest Framework, including available endpoints and integration details. -- **Running vLLM Llama3.1-70B and vLLM Mock Model(s) in TT-Studio**: [HowToRun_vLLM_Models.md](HowToRun_vLLM_Models.md) +- **Running vLLM Model(s) and Mock vLLM Model in TT-Studio**: [HowToRun_vLLM_Models.md](HowToRun_vLLM_Models.md) Step-by-step instructions on how to configure and run the vLLM model(s) using TT-Studio. - **Contribution Guide**: [CONTRIBUTING.md](CONTRIBUTING.md) diff --git a/app/.env.default b/app/.env.default index 6b6f24ba..db81d8fe 100644 --- a/app/.env.default +++ b/app/.env.default @@ -6,4 +6,3 @@ VLLM_LLAMA31_ENV_FILE="" # SECURITY WARNING: keep these secret in production! JWT_SECRET=test-secret-456 DJANGO_SECRET_KEY=django-insecure-default -HF_TOKEN= # Get this from Hugging Face diff --git a/app/api/docker_control/docker_utils.py b/app/api/docker_control/docker_utils.py index c9d1cc56..163f4ce5 100644 --- a/app/api/docker_control/docker_utils.py +++ b/app/api/docker_control/docker_utils.py @@ -35,7 +35,9 @@ def run_container(impl, weights_id): logger.info(f"run_container called for {impl.model_name}") run_kwargs = copy.deepcopy(impl.docker_config) # handle runtime configuration changes to docker kwargs - run_kwargs.update({"devices": get_devices_mounts(impl)}) + device_mounts = get_devices_mounts(impl) + if device_mounts: + run_kwargs.update({"devices": device_mounts}) run_kwargs.update({"ports": get_port_mounts(impl)}) # add bridge inter-container network run_kwargs.update({"network": backend_config.docker_bridge_network_name}) @@ -87,14 +89,18 @@ def get_devices_mounts(impl): device_config = get_runtime_device_configuration(impl.device_configurations) assert isinstance(device_config, DeviceConfigurations) # TODO: add logic to handle multiple devices and multiple containers - # e.g. running falcon-7B and mistral-7B on 2x n150 machine - if device_config in {DeviceConfigurations.N150, DeviceConfigurations.E150}: - devices = ["/dev/tenstorrent/0:/dev/tenstorrent/0"] - elif device_config == DeviceConfigurations.N300x4: - devices = ["/dev/tenstorrent:/dev/tenstorrent"] - elif device_config == DeviceConfigurations.CPU: - devices = None - return devices + single_device_mounts = ["/dev/tenstorrent/0:/dev/tenstorrent/0"] + all_device_mounts = ["/dev/tenstorrent:/dev/tenstorrent"] + device_map = { + DeviceConfigurations.E150: single_device_mounts, + DeviceConfigurations.N150: single_device_mounts, + DeviceConfigurations.N150_WH_ARCH_YAML: single_device_mounts, + DeviceConfigurations.N300: single_device_mounts, + DeviceConfigurations.N300x4_WH_ARCH_YAML: all_device_mounts, + DeviceConfigurations.N300x4: all_device_mounts, + } + device_mounts = device_map.get(device_config) + return device_mounts def get_port_mounts(impl): @@ -187,15 +193,19 @@ def get_container_status(): def update_deploy_cache(): data = get_container_status() for con_id, con in data.items(): - model_impl = [ - v - for k, v in model_implmentations.items() - if v.image_version == con["image_name"] - ] - assert ( - len(model_impl) == 1 - ), f"Cannot find model_impl={model_impl} for {con['image_name']}" - model_impl = model_impl[0] + con_model_id = con['env_vars'].get("MODEL_ID") + model_impl = model_implmentations.get(con_model_id) + if not model_impl: + # fallback to finding first impl that uses that container + model_impl = [ + v + for k, v in model_implmentations.items() + if v.image_version == con["image_name"] + ] + assert ( + len(model_impl) == 1 + ), f"Cannot find model_impl={model_impl} for {con['image_name']}" + model_impl = model_impl[0] con["model_id"] = model_impl.model_id con["weights_id"] = con["env_vars"].get("MODEL_WEIGHTS_ID") con["model_impl"] = model_impl diff --git a/app/api/model_control/apps.py b/app/api/model_control/apps.py index b310145a..e7a0b543 100644 --- a/app/api/model_control/apps.py +++ b/app/api/model_control/apps.py @@ -19,4 +19,4 @@ def ready(self): # run once logger.info("Initializing models API") for model_id, impl in model_implmentations.items(): - impl.init_volumes() + impl.setup() diff --git a/app/api/model_control/views.py b/app/api/model_control/views.py index b4b4e1af..7123819e 100644 --- a/app/api/model_control/views.py +++ b/app/api/model_control/views.py @@ -38,7 +38,7 @@ def post(self, request, *args, **kwargs): internal_url = "http://" + deploy["internal_url"] logger.info(f"internal_url:= {internal_url}") logger.info(f"using vllm model:= {deploy["model_impl"].model_name}") - data["model"] = deploy["model_impl"].hf_model_path + data["model"] = deploy["model_impl"].hf_model_id response_stream = stream_response_from_external_api(internal_url, data) return StreamingHttpResponse(response_stream, content_type="text/plain") else: diff --git a/app/api/shared_config/backend_config.py b/app/api/shared_config/backend_config.py index 88cd722c..555b205b 100644 --- a/app/api/shared_config/backend_config.py +++ b/app/api/shared_config/backend_config.py @@ -18,7 +18,6 @@ class BackendConfig: weights_dir: str model_container_cache_root: str jwt_secret: str - hf_token: str # environment variables are ideally terminated on import to fail-fast and provide obvious @@ -34,9 +33,8 @@ class BackendConfig: django_deploy_cache_name="deploy_cache", docker_bridge_network_name="tt_studio_network", weights_dir="model_weights", - model_container_cache_root="/home/user/cache_root", + model_container_cache_root="/home/container_app_user/cache_root", jwt_secret=os.environ["JWT_SECRET"], - hf_token=os.environ["HF_TOKEN"], ) # make backend volume if not existing diff --git a/app/api/shared_config/device_config.py b/app/api/shared_config/device_config.py index e4035204..151b1433 100644 --- a/app/api/shared_config/device_config.py +++ b/app/api/shared_config/device_config.py @@ -10,6 +10,9 @@ class DeviceConfigurations(Enum): CPU = auto() E150 = auto() N150 = auto() + N300 = auto() + T3K_RING = auto() + T3K_LINE = auto() N150_WH_ARCH_YAML = auto() N300x4 = auto() N300x4_WH_ARCH_YAML = auto() diff --git a/app/api/shared_config/model_config.py b/app/api/shared_config/model_config.py index e6fa4506..e33c3a33 100644 --- a/app/api/shared_config/model_config.py +++ b/app/api/shared_config/model_config.py @@ -9,6 +9,7 @@ from shared_config.device_config import DeviceConfigurations from shared_config.backend_config import backend_config +from shared_config.setup_config import SetupTypes from shared_config.logger_config import get_logger logger = get_logger(__name__) @@ -16,18 +17,24 @@ def load_dotenv_dict(env_path: Union[str, Path]) -> Dict[str, str]: + if not env_path: + return {} + + # instead, use tt-studio configured JWT_SECRET + exluded_keys = ["JWT_SECRET"] env_path = Path(env_path) if not env_path.exists(): logger.error(f"Env file not found: {env_path}") env_dict = {} + logger.info(f"Using env file: {env_path}") with open(env_path) as f: lines = f.readlines() for line in lines: if line.strip() and not line.startswith('#'): key, value = line.strip().split('=', 1) # expand any $VAR or ${VAR} and ~ - value = os.path.expandvars(value) - env_dict[key] = value + if key not in exluded_keys: + env_dict[key] = value return env_dict @@ -37,28 +44,36 @@ class ModelImpl: Model implementation configuration defines everything known about a model implementations before runtime, e.g. not handling ports, available devices""" - model_name: str - model_id: str image_name: str image_tag: str device_configurations: Set["DeviceConfigurations"] docker_config: Dict[str, Any] - user_uid: int # user inside docker container uid (for file permissions) - user_gid: int # user inside docker container gid (for file permissions) - shm_size: str - service_port: int service_route: str + setup_type: SetupTypes + hf_model_id: str = None + model_name: str = None # uses defaults based on hf_model_id + model_id: str = None # uses defaults based on hf_model_id + impl_id: str = "tt-metal" # implementation ID + version: str = "0.0.1" + shm_size: str = "32G" + service_port: int = 7000 env_file: str = "" health_route: str = "/health" - hf_model_path: str = "" def __post_init__(self): + # _init methods compute values that are dependent on other values + self._init_model_name() + self.docker_config.update({"volumes": self.get_volume_mounts()}) self.docker_config["shm_size"] = self.shm_size - self.docker_config["environment"]["HF_MODEL_PATH"] = self.hf_model_path + self.docker_config["environment"]["HF_MODEL_PATH"] = self.hf_model_id self.docker_config["environment"]["HF_HOME"] = Path( backend_config.model_container_cache_root ).joinpath("huggingface") + + # Set environment variable if N150 or N300x4 is in the device configurations + if DeviceConfigurations.N150 in self.device_configurations or DeviceConfigurations.N300x4 in self.device_configurations: + self.docker_config["environment"]["WH_ARCH_YAML"] = "wormhole_b0_80_arch_eth_dispatch.yaml" # Set environment variable if N150_WH_ARCH_YAML or N300x4_WH_ARCH_YAML is in the device configurations if ( @@ -69,12 +84,25 @@ def __post_init__(self): "wormhole_b0_80_arch_eth_dispatch.yaml" ) - if self.env_file: - logger.info(f"Using env file: {self.env_file}") - # env file should be in persistent volume mounted - env_dict = load_dotenv_dict(self.env_file) - # env file overrides any existing docker environment variables - self.docker_config["environment"].update(env_dict) + # model env file must be interpreted here + if not self.env_file: + _env_file = self.get_model_env_file() + else: + _env_file = self.env_file + + # env file should be in persistent volume mounted + env_dict = load_dotenv_dict(_env_file) + # env file overrides any existing docker environment variables + self.docker_config["environment"].update(env_dict) + + # Set environment variable if N150_WH_ARCH_YAML or N300x4_WH_ARCH_YAML is in the device configurations + if ( + DeviceConfigurations.N150_WH_ARCH_YAML in self.device_configurations + or DeviceConfigurations.N300x4_WH_ARCH_YAML in self.device_configurations + ): + self.docker_config["environment"]["WH_ARCH_YAML"] = ( + "wormhole_b0_80_arch_eth_dispatch.yaml" + ) @property def image_version(self) -> str: @@ -115,6 +143,36 @@ def model_container_weights_dir(self) -> Path: def backend_hf_home(self) -> Path: return self.backend_weights_dir.joinpath("huggingface") + def _init_model_name(self): + # Note: ONLY run this in __post_init__ + # need to use __setattr__ because instance is frozen + assert self.hf_model_id or self.model_name, "either hf_model_id or model_name must be set." + if not self.model_name: + # use basename of HF model ID to use same format as tt-transformers + object.__setattr__(self, 'model_name', Path(self.hf_model_id).name) + if not self.model_id: + object.__setattr__(self, 'model_id', self.get_default_model_id()) + if not self.hf_model_id: + logger.info(f"model_name:={self.model_name} does not have a hf_model_id set") + + def get_default_model_id(self): + return f"id_{self.impl_id}-{self.model_name}-v{self.version}" + + def get_model_env_file(self): + ret_env_file = None + model_env_dir_name = "model_envs" + model_env_dir = Path(backend_config.persistent_storage_volume).joinpath(model_env_dir_name) + if model_env_dir.exists(): + env_fname = f"{self.model_name}.env" + model_env_fpath = model_env_dir.joinpath(env_fname) + if model_env_fpath.exists(): + ret_env_file = model_env_fpath + else: + logger.warning(f"for model {self.model_name} env file: {model_env_fpath} does not exist, have you run tt-inference-server setup.sh for the model?") + else: + logger.warning(f"{model_env_dir} does not exist, have you run tt-inference-server setup.sh?") + return ret_env_file + def get_volume_mounts(self): # use type=volume for persistent storage with a Docker managed named volume # target: this should be set to same location as the CACHE_ROOT environment var @@ -128,14 +186,24 @@ def get_volume_mounts(self): } return volume_mounts + def setup(self): + # verify model setup and runtime setup + self.init_volumes() + def init_volumes(self): - # need to make directory in app backend container to allow for correct perimission to be set - self.volume_path.mkdir(parents=True, exist_ok=True) - os.chown(self.volume_path, uid=self.user_uid, gid=self.user_gid) - self.backend_weights_dir.mkdir(parents=True, exist_ok=True) - os.chown(self.backend_weights_dir, uid=self.user_uid, gid=self.user_gid) - # self.backend_hf_home.mkdir(parents=True, exist_ok=True) - # os.chown(self.backend_hf_home, uid=self.user_uid, gid=self.user_gid) + # check volumes + if self.setup_type == SetupTypes.TT_INFERENCE_SERVER: + if self.volume_path.exists(): + logger.info(f"Found {self.volume_path}") + else: + logger.info(f"Model volume does not exist: {self.volume_path}") + logger.error(f"Initialize this model by running the tt-inference-server setup.sh script") + elif self.setup_type == SetupTypes.MAKE_VOLUMES: + if not self.volume_path.exists(): + # if not setup is required for the model, backend can make the volume + self.volume_path.mkdir(parents=True, exist_ok=True) + elif self.setup_type == SetupTypes.NO_SETUP: + logger.info(f"Model {self.model_id} does not require a volume") def asdict(self): return asdict(self) @@ -144,14 +212,12 @@ def asdict(self): def base_docker_config(): return { # Note: mounts and devices are determined in `docker_utils.py` - "user": "user", "auto_remove": True, "cap_add": "ALL", # TODO: add minimal permissions "detach": True, "environment": { "JWT_SECRET": backend_config.jwt_secret, "CACHE_ROOT": backend_config.model_container_cache_root, - "HF_TOKEN": backend_config.hf_token, }, } @@ -166,71 +232,91 @@ def base_docker_config(): image_tag="v0.0.1-tt-metal-65d246482b3f", device_configurations={DeviceConfigurations.N150}, docker_config=base_docker_config(), - user_uid=1000, - user_gid=1000, shm_size="32G", service_port=7000, service_route="/objdetection_v2", + setup_type=SetupTypes.NO_SETUP, ), ModelImpl( + hf_model_id="meta-llama/Llama-3.1-70B-Instruct", model_name="Mock-Llama-3.1-70B-Instruct", model_id="id_mock_vllm_modelv0.0.1", image_name="ghcr.io/tenstorrent/tt-inference-server/mock.vllm.openai.api", image_tag="v0.0.1-tt-metal-385904186f81-384f1790c3be", - hf_model_path="meta-llama/Llama-3.1-70B-Instruct", device_configurations={DeviceConfigurations.CPU}, docker_config=base_docker_config(), - user_uid=1000, - user_gid=1000, shm_size="1G", service_port=7000, service_route="/v1/chat/completions", + setup_type=SetupTypes.MAKE_VOLUMES, ), ModelImpl( - model_name="Falcon-7B-Instruct", - model_id="id_tt-metal-falcon-7bv0.0.13", - image_name="tt-metal-falcon-7b", - image_tag="v0.0.13", - device_configurations={DeviceConfigurations.N150_WH_ARCH_YAML}, - hf_model_path="tiiuae/falcon-7b-instruct", - docker_config=base_docker_config(), - user_uid=1000, - user_gid=1000, - shm_size="32G", - service_port=7000, - service_route="/inference/falcon7b", - ), - ModelImpl( - model_name="Llama-3.1-70B-Instruct", - model_id="id_tt-metal-llama-3.1-70b-instructv0.0.1", + hf_model_id="meta-llama/Llama-3.1-70B-Instruct", image_name="ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm", image_tag="v0.0.3-tt-metal-385904186f81-384f1790c3be", - hf_model_path="meta-llama/Llama-3.1-70B-Instruct", device_configurations={DeviceConfigurations.N300x4_WH_ARCH_YAML}, docker_config=base_docker_config(), - user_uid=1000, - user_gid=1000, shm_size="32G", service_port=7000, service_route="/v1/chat/completions", env_file=os.environ.get("VLLM_LLAMA31_ENV_FILE"), + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.2-1B-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.2-3B-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.1-8B-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.2-11B-Vision-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.1-70B-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, + ), + ModelImpl( + hf_model_id="meta-llama/Llama-3.3-70B-Instruct", + image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64", + image_tag="v0.0.1-47fb1a2fb6e0-2f33504bad49", + device_configurations={DeviceConfigurations.N300x4}, + docker_config=base_docker_config(), + service_route="/v1/chat/completions", + setup_type=SetupTypes.TT_INFERENCE_SERVER, ), #! Add new model vLLM model implementations here - # ModelImpl( - # model_name="", #? Add the model name for the vLLM model based on persistent storage - # model_id="", #? Add the model id for the vLLM model based on persistent storage - # image_name="ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm", - # image_tag="v0.0.1-tt-metal-685ef1303b5a-54b9157d852b", - # hf_model_path="meta-llama/Llama-3.1-70B-Instruct", - # device_configurations={DeviceConfigurations.N300x4}, - # docker_config=base_docker_config(), - # user_uid=1000, - # user_gid=1000, - # shm_size="32G", - # service_port=7000, - # service_route="/inference/**", #? Add the correct route for the vLLM model - # env_file=os.environ.get("VLLM_LLAMA31_ENV_FILE"), - # ) ] def validate_model_implemenation_config(impl): diff --git a/app/api/shared_config/setup_config.py b/app/api/shared_config/setup_config.py new file mode 100644 index 00000000..8d5f272c --- /dev/null +++ b/app/api/shared_config/setup_config.py @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: Apache-2.0 +# +# SPDX-FileCopyrightText: Β© 2025 Tenstorrent AI ULC +from enum import IntEnum, auto + +class SetupTypes(IntEnum): + NO_SETUP = auto() # 1 + MAKE_VOLUMES = auto() # 2 + TT_INFERENCE_SERVER = auto() # 3 diff --git a/app/docker-compose.yml b/app/docker-compose.yml index 76895274..e2ffb8ce 100644 --- a/app/docker-compose.yml +++ b/app/docker-compose.yml @@ -38,9 +38,7 @@ services: - HOST_PERSISTENT_STORAGE_VOLUME - INTERNAL_PERSISTENT_STORAGE_VOLUME - BACKEND_API_HOSTNAME - - VLLM_LLAMA31_ENV_FILE - JWT_SECRET - - HF_TOKEN volumes: # mounting docker unix socket allows for backend container to run docker cmds - /var/run/docker.sock:/var/run/docker.sock diff --git a/startup.sh b/startup.sh index 7eb6e04a..ab86be0a 100755 --- a/startup.sh +++ b/startup.sh @@ -4,6 +4,8 @@ # # SPDX-FileCopyrightText: Β© 2024 Tenstorrent AI ULC +set -euo pipefail # Exit on error, print commands, unset variables treated as errors, and exit on pipeline failure + # Define setup script path SETUP_SCRIPT="./setup.sh" @@ -156,8 +158,16 @@ else exit 1 fi -# Step 2: Source env vars +# Step 2: Source env vars, ensure directories source "${ENV_FILE_PATH}" +# make persistent volume on host user user permissions +if [ ! -d "$HOST_PERSISTENT_STORAGE_VOLUME" ]; then + mkdir "$HOST_PERSISTENT_STORAGE_VOLUME" + if [ $? -ne 0 ]; then + echo "β›” Error: Failed to create directory $HOST_PERSISTENT_STORAGE_VOLUME" + exit 1 + fi +fi # Step 3: Check if the Docker network already exists NETWORK_NAME="tt_studio_network"