This guide walks you through setting up vLLM Llama models and vLLM Mock models via the TT-Inference-Server, and then deploying them via TT-Studio.
For the complete and up-to-date list of models supported by TT-Studio via TT-Inference-Server, please refer to TT-Inference-Server GitHub README.
-
Docker: Make sure Docker is installed on your system. Follow the Docker installation guide.
-
Hugging Face Token: Both models require authentication to Hugging Face repositories. To obtain a token, go to Hugging Face Account and generate a token. Additionally; make sure to accept the terms and conditions on Hugging Face for the the desired model(s).
- Clone repositories
- Pull the mock model Docker image
- Set up the Hugging Face (HF) token
- Deploy and run inference for the model via the GUI
- Clone repositories
- Pull the model Docker image
- Run the model setup script
- Deploy and run inference for the model via the GUI
Start by cloning both the tt-studio
and tt-inference-server
repositories.
# Clone tt-studio
git clone https://github.com/tenstorrent/tt-studio
cd tt-studio
# Make the setup script executable
chmod +x startup.sh
# Clone `tt-inference-server` into a separate directory
cd ..
git clone https://github.com/tenstorrent/tt-inference-server
-
Navigate to the Docker Images:
-
Pull the Desired Model Docker Image:
docker pull ghcr.io/tenstorrent/tt-inference-server/:<model-image>:<image-tag>
-
Authenticate Your Terminal (Optional - If Pull Command Fails)):
echo YOUR_PAT | docker login ghcr.io -u YOUR_USERNAME --password-stdin
Add the Hugging Face Token within the .env
file in the tt-studio/app/
directory.
HF_TOKEN=hf_********
Follow these step-by-step instructions to smoothly automate the process of setting up model weights.
-
Create the
tt_studio_persistent_volume
folder- Either create this folder manually inside
tt-studio/
, or run./startup.sh
from withintt-studio
to have it created automatically.
- Either create this folder manually inside
-
Ensure folder permissions
- Verify that you (the user) have permission to edit the newly created folder. If not, adjust ownership or permissions using commands like
chmod
orchown
.
- Verify that you (the user) have permission to edit the newly created folder. If not, adjust ownership or permissions using commands like
-
Navigate to
tt-inference-server
- Consult the README to see which model servers are supported by TT-Studio.
-
Run the automated setup script
-
Execute the script
Navigate tott-inference-server
, run:./setup.sh **Model**
-
Choose how to provide the model
You will see:How do you want to provide a model? 1) Download from 🤗 Hugging Face (default) 2) Download from Meta 3) Local folder Enter your choice:
For first-time users, we recommend option 1 (Hugging Face).
-
Next Set
PERSISTENT_VOLUME_ROOT
The script will prompt you for aPERSISTENT_VOLUME_ROOT
path. A default path will be suggested, but do not accept the default. Instead, specify the absolute path to yourtt-studio/tt_studio_persistent_volume
directory to maintain the correct structure. Using the default path can lead to incorrect configurations. -
Validate token and set environment variables
The script will:- Validate your Hugging Face token (
HF_TOKEN
). - Prompt you for an
HF_HOME
location (default is often~/.cache/huggingface
). - Ask for a JWT secret, which should match the one in
tt-studio/app/.env
(commonlytest-secret-456
).
- Validate your Hugging Face token (
-
By following these steps, your tt-inference-server model infrastructure will be correctly configured and ready for inference via the TT-Studio GUI.
When using the setup script it creates (or updates) specific directories and files within your tt_studio_persistent_volume
folder. Here’s what to look for:
-
Model Weights Directories
Verify that the weights are correctly stored in a directory similar to:/path/to/tt-studio/tt_studio_persistent_volume/ ├── model_envs │ └── Llama-3.1-70B-Instruct.env └── volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/ ├── layers_0-4.pth ├── layers_5-9.pth ├── params.json └── tokenizer.model
- Ensure all expected weight files (e.g.,
layers_0-4.pth
,params.json
,tokenizer.model
) are present. - If any files are missing, re-run the
setup.sh
script to complete the download.
- Ensure all expected weight files (e.g.,
-
model_envs
Folder
Within yourtt_studio_persistent_volume
, you will also find amodel_envs
folder (e.g.,model_envs/Llama-3.1-8B-Instruct.env
).- Each
.env
file contains the values you input during the setup script run (e.g.,HF_TOKEN
,HF_HOME
,JWT_SECRET
). - Verify that these environment variables match what you entered; if you need to adjust them, re-run the setup process.
- Each
This folder and file structure allows TT-Studio to automatically recognize and access models without any additional configuration steps.
- Start TT-Studio: Run TT-Studio using the startup command.
- Access Model Weights: In the TT-Studio interface, navigate to the model weights section.
- Select Weights: Select the model weights.
- Run the Model: Start the model and wait for it to initialize.
To view real-time logs from the container, use the following command:
docker logs -f <container_id>
During container initialization, you may encounter log entries like the following, which indicate that the vLLM server has started successfully:
INFO 12-11 08:10:36 tt_executor.py:67] # TT blocks: 2068, # CPU blocks: 0
INFO 12-11 08:10:36 tt_worker.py:66] Allocating kv caches
INFO 12-11 08:10:36 api_server.py:232] vLLM to use /tmp/tmp3ki28i0p as PROMETHEUS_MULTIPROC_DIR
INFO 12-11 08:10:36 launcher.py:19] Available routes are:
INFO 12-11 08:10:36 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /health, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /version, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/completions, Methods: POST
INFO: Application startup complete.
To access the container's shell for debugging or manual inspection, use the following command:
docker exec -it <container_id> bash
Use env
to check environment variables or run commands directly to inspect the environment. To verify if the server is running properly, you can attempt to manually start it by running:
python ***_vllm_api_server.py
This will allow you to check for any startup errors or issues directly from the container's shell.
curl -s --no-buffer -X POST "http://localhost:7000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Hi"}]}'
If successful, you will receive a response from the model.
If you need to adjust permissions for the tt_studio_persistent_volume
folder, first determine your user and group IDs by running: (replace paths as necessary)
id
You will see an output similar to:
uid=1001(youruser) gid=1001(yourgroup) groups=...
Use these numeric IDs to set the correct ownership. For example:
sudo chown -R 1001:1001 /home/youruser/tt-studio/tt_studio_persistent_volume/
Replace 1001:1001
with your actual UID:GID and /home/youruser/tt-studio/tt_studio_persistent_volume/
with the path to your persistent volume folder.
With the setup complete, you’re ready to run inference on the vLLM models (or any other supported model(s)) within TT-Studio. Refer to the documentation and setup instructions in the repositories for further guidance.