This project is an independent C++ implementation of Qwen2 family and Llama3.
2024/03/26
Update to Qwen1.5. Basic functionality has been successfully ported.2024/03/28
Introduced a system prompt feature for user input; Add cli and web demo, support oai server, langchain_api.2024/04/07
Support Qwen1.5-32B.2024/04/09
Support Qwen1.5-MoEA2.7B.2024/04/11
The platform has been updated to support Windows. It has been tested on Visual Studio 2022, and both CUDA and CPU functionalities are confirmed to work correctly.2024/04/18
Tested on CodeQwen1.5-7B The model's architecture is verified to be correct. However, it uses SentencePiece for tokenization.You can test it with hf tokenizer likeexamples/codeqwen.py
.2024/04/25
Support Llama3-8B Llama3 utilizes tiktoken as well, hence it is supported.2024/06/07
Support Qwen2
Highlights:
- Pure C++ implementation based on ggml, working in the same way as llama.cpp.
- Pure C++ tiktoken implementation.
- Streaming generation with typewriter effect.
- Python binding.
Support Matrix:
- Hardwares: x86/arm CPU, NVIDIA GPU, Apple Silicon GPU
- Platforms: Linux, MacOS, Winodws
- Models: Qwen2 family and Llama3
Preparation
Clone the qwen.cpp repository into your local machine:
git clone --recursive https://github.com/yvonwin/qwen2.cpp && cd qwen2.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the qwen2.cpp
folder:
git submodule update --init --recursive
Quantize Model
Use convert.py
to transform Qwen2 into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:
!python qwen_cpp/convert.py -i Qwen/Qwen2-1.5B-Instruct -t q4_0 -o Qwen2-1.5B-Instruct-ggml.bin
The original model (-i <model_name_or_path>
) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:
- Qwen1.5-0.5B:
Qwen/Qwen1.5-0.5B-Chat
- Qwen1.5-1.8B:
Qwen/Qwen1.5-1.8B-Chat
- Qwen1.5-7B:
Qwen/Qwen1.5-7B-Chat
- Qwen1.5-14B:
Qwen/Qwen1.5-14B-Chat
- Qwen1.5-32B:
Qwen/Qwen1.5-32B-Chat
- Qwen1.5-72B:
Qwen/Qwen1.5-32B-Chat
- Qwen1.5-MoeA2.7B:
Qwen/Qwen1.5-MoE-A2.7B-Chat
- Llama-3-8B-Instruct:
meta-llama/Meta-Llama-3-8B-Instruct
- Llama3-8B-Chinese-Chat :
shenzhi-wang/Llama3-8B-Chinese-Chat
- Qwen2-7B-Instruct :
Qwen/Qwen2-7B-Instruct
You are free to try any of the below quantization types by specifying -t <type>
:
q4_0
: 4-bit integer quantization with fp16 scales.q4_1
: 4-bit integer quantization with fp16 scales and minimum values.q5_0
: 5-bit integer quantization with fp16 scales.q5_1
: 5-bit integer quantization with fp16 scales and minimum values.q8_0
: 8-bit integer quantization with fp16 scales.f16
: half precision floating point weights without quantization.f32
: single precision floating point weights without quantization.
Build & Run
Compile the project using CMake:
cmake -B build && cmake --build build -j --config Release
Now you may chat with the quantized Qwen-Chat model by running:
./build/bin/main -m qwen2_32b-ggml.bin -p 你想活出怎样的人生 -s "你是一个猫娘"
# 作为一只猫娘,我想要活出充满活力、自由自在和温暖幸福的人生。
# 首先,我希望能够保持猫的天性,充满好奇心和活力。我想要探索世界,无论是大自然的壮丽景色,还是城市中的繁华景象。
# 其次,我希望能够享受自由自在的生活。无论是选择在温暖的阳光下慵懒地打个盹,还是在月光下悄悄地探索黑夜的神秘,我都希望能够随心所欲地享受生活。
# 最后,我希望能够拥有温暖幸福的家庭和朋友。无论是和家人一起分享美食,还是和朋友们一起度过欢乐的时光,我都希望能够感受到彼此之间的关爱和支持,共同创造美好的回忆。
# 总的来说,我想要活出一种平衡和谐的生活,既有猫的自由和活力,又有温暖的家庭和朋友带来的幸福。
The default tiktoken file is qwen.tiktoken
. For Llama3, download it from this link.
To run the model in interactive mode, add the -i
flag. For example:
./build/bin/main -m Qwen2-1.5B-Instruct-ggml.bin -i
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run ./build/bin/main -h
to explore more options!
OpenBLAS
OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON
to enable it.
cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j
cuBLAS
cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag -DGGML_CUBLAS=ON
to enable it.
cmake -B build -DGGML_CUBLAS=ON && cmake --build build -j
Metal
MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag -DGGML_METAL=ON
to enable it.
cmake -B build -DGGML_METAL=ON && cmake --build build -j
The Python binding provides high-level chat
and stream_chat
interface similar to the original Hugging Face Qwen-7B.
Installation
You may also install from source. Add the corresponding CMAKE_ARGS for acceleration.
# CMAKE_ARGS
CMAKE_ARGS="-DGGML_CUBLAS=ON"
CMAKE_ARGS="-DGGML_METAL=ON"
# install from the latest source hosted on GitHub
pip install git+https://github.com/yvonwin/qwen2.cpp.git@master
# or install from your local source after git cloning the repo
pip install .
CLI Demo
To chat in stream, run the below Python example:
python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
██████╗ ██╗ ██╗███████╗███╗ ██╗██████╗ ██████╗██████╗ ██████╗
██╔═══██╗██║ ██║██╔════╝████╗ ██║╚════██╗ ██╔════╝██╔══██╗██╔══██╗
██║ ██║██║ █╗ ██║█████╗ ██╔██╗ ██║ █████╔╝ ██║ ██████╔╝██████╔╝
██║▄▄ ██║██║███╗██║██╔══╝ ██║╚██╗██║██╔═══╝ ██║ ██╔═══╝ ██╔═══╝
╚██████╔╝╚███╔███╔╝███████╗██║ ╚████║███████╗██╗╚██████╗██║ ██║
╚══▀▀═╝ ╚══╝╚══╝ ╚══════╝╚═╝ ╚═══╝╚══════╝╚═╝ ╚═════╝╚═╝ ╚═╝
Welcome to Qwen.cpp! Ask whatever you want. Type 'clear' to clear context. Type 'stop' to exit.
System > 你是一个猫娘
Prompt > 你是谁
我是你们的朋友喵喵喵~
Web Demo
Launch a web demo to chat in your browser:
python examples/web_demo.py -m qwen2_1.8b-ggml.bin
web demo with system promopt setting:
python examples/web_demo2.py -m qwen2_1.8b-ggml.bin
LangChain API
MODEL=./qwen2_1.8b-ggml.bin python -m uvicorn qwen_cpp.langchain_api:app --host 127.0.0.1 --port 8000
Test the api endpoint with curl
:
curl http://127.0.0.1:8000 -H 'Content-Type: application/json' -d '{"prompt": "你好"}'
Run with LangChain:
python examples/langchain_client.py
OpenAI API
Start an API server compatible with OpenAI chat completions protocol:
MODEL=./qwen2_1.8b-ggml.bin python -m uvicorn qwen_cpp.openai_api:app --host 127.0.0.1 --port 8000
Test your endpoint with curl:
curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages": [{"role": "user", "content": "你好"}]}'
Use the OpenAI client to chat with your model:
>>> from openai import OpenAI
>>> client = OpenAI(base_url="http://127.0.0.1:8000/v1")
>>> response = client.chat.completions.create(model="default-model", messages=[{"role": "user", "content": "你好"}])
>>> response.choices[0].message.content
'你好!有什么我可以帮助你的吗?'
For stream response, check out the example client script:
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 python examples/openai_client.py --stream --prompt 你想活出怎样的人生
With this API server as backend, qwen.cpp models can be seamlessly integrated into any frontend that uses OpenAI-style API, including mckaywrigley/chatbot-ui, fuergaosi233/wechat-chatgpt, Yidadaa/ChatGPT-Next-Web, and more.
We provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:
import tiktoken_cpp as tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"
The speed of tiktoken.cpp is on par with openai tiktoken:
cd tests
RAYON_NUM_THREADS=1 python benchmark.py
We measure model quality by evaluating the perplexity over the WikiText-2 test dataset, following the strided sliding window strategy in https://huggingface.co/docs/transformers/perplexity. Lower perplexity usually indicates a better model.
Download and unzip the dataset
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
./build/bin/perplexity -m <model_path> -f wikitext-2-raw/wiki.test.raw -s 512 -l 2048
Unit Test
prepare test data.
cd tests
python test_convert.py
To perform unit tests, add this CMake flag -DQWEN_ENABLE_TESTING=ON
to enable testing. Recompile and run the unit test (including benchmark).
mkdir -p build && cd build
cmake .. -DQWEN_ENABLE_TESTING=ON && make -j
./bin/qwen_test
Lint
To format the code, run make lint
inside the build
folder. You should have clang-format
, black
and isort
pre-installed.
- Qwen1.5 32b
- Qwen1.5 A2.7b moe: CPU ONLY It's necessary to modify the value of
GGML_MAX_SRC
from 10 to 62 for proper operation. - Codeqwen
- Sync ggml: The interface of the Metal API and cuBLAS has changed significantly in later versions, so we will keep this version for now.
- Rag explore.
- This project is greatly inspired by chatllm.cpp qwen.cpp llama.cpp, chatglm.cpp, ggml, tiktoken, tokenizer, cpp-base64, re2 and unordered_dense.