Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker container support #1271

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Prev Previous commit
Next Next commit
Add cuda bundled README.md
Sing-Li authored Nov 15, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit a00b4e3617f21f146c19e602382aa8609c6d8c63
60 changes: 60 additions & 0 deletions docker/cuda/cuda121bundled/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
### Combining weights and library of a model into a mega-sized container image

#### Pre-requisites

These containers will ONLY run on:

* System with Nvidia GPUs supporting the specified CUDA version(s)
* System installed and tested with [NVidia Container Runtime](https://developer.nvidia.com/nvidia-container-runtime)

They will not run on any other system, or on systems with unsupported Nvidia GPUs.


#### Quick start

Before building the image, you MUST place the associated weights and library of the model into the `fixedmodel` directory.

1. place the weights into the `fixedmodel/dist/prebuilt` folder (the default example is for Llama-2-7b-chat-hf-q4f16_1 and `mlc-chat-Llama-2-7b-chat-hf-q4f16_1` from huggingface should be `git cloned` here)
2. place the library into `fixedmodel/dist/prebuilt/lib` folder (in the case of cuda121 for llama2 7b q4f16_1, it will be `Llama-2-7b-chat-hf-q4f16_1-cuda.so`)

Build the image:

```
cd cuda121bundled
sh ./buildimage.sh
```

This will create the `llama2cuda121` bundled image. You can then push it to your local or public registry for deployment.

To run it you will use:

```
docker run --gpus all --network host --rm llama2cuda121:v0.1
```

By default the container will listen to `port 8000` of the host machine for REST API calls.

You can use the python script in the `test` subfolder to test your container:

```
$ python sample_client_for-testing.py
Without streaming:
Of course! Here is a haiku for you:

Sun sets slowly down
Golden hues upon the sea
Peaceful evening sky

Reset chat: <Response [200]>

With streaming:
Of course! Here is a haiku for you:

Respectful and true
Honest and helpful assistant
Serving with grace

Runtime stats: prefill: 259.2 tok/s, decode: 145.6 tok/s
```

See the `test` subfolder for examples of these scripts.