Skip to content

yvonwin/qwen2.cpp

Repository files navigation

qwen2.cpp

中文版

This project is an independent C++ implementation of Qwen2 family and Llama3.

Updates

  • 2024/03/26 Update to Qwen1.5. Basic functionality has been successfully ported.
  • 2024/03/28 Introduced a system prompt feature for user input; Add cli and web demo, support oai server, langchain_api.
  • 2024/04/07 Support Qwen1.5-32B.
  • 2024/04/09 Support Qwen1.5-MoEA2.7B.
  • 2024/04/11 The platform has been updated to support Windows. It has been tested on Visual Studio 2022, and both CUDA and CPU functionalities are confirmed to work correctly.
  • 2024/04/18 Tested on CodeQwen1.5-7B The model's architecture is verified to be correct. However, it uses SentencePiece for tokenization.You can test it with hf tokenizer like examples/codeqwen.py.
  • 2024/04/25 Support Llama3-8B Llama3 utilizes tiktoken as well, hence it is supported.
  • 2024/06/07 Support Qwen2

Features

Highlights:

  • Pure C++ implementation based on ggml, working in the same way as llama.cpp.
  • Pure C++ tiktoken implementation.
  • Streaming generation with typewriter effect.
  • Python binding.

Support Matrix:

  • Hardwares: x86/arm CPU, NVIDIA GPU, Apple Silicon GPU
  • Platforms: Linux, MacOS, Winodws
  • Models: Qwen2 family and Llama3

Test in colab

Open In Colab

Getting Started

Preparation

Clone the qwen.cpp repository into your local machine:

git clone --recursive https://github.com/yvonwin/qwen2.cpp && cd qwen2.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the qwen2.cpp folder:

git submodule update --init --recursive

Quantize Model

Use convert.py to transform Qwen2 into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:

!python qwen_cpp/convert.py -i Qwen/Qwen2-1.5B-Instruct -t q4_0 -o Qwen2-1.5B-Instruct-ggml.bin

The original model (-i <model_name_or_path>) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:

  • Qwen1.5-0.5B: Qwen/Qwen1.5-0.5B-Chat
  • Qwen1.5-1.8B: Qwen/Qwen1.5-1.8B-Chat
  • Qwen1.5-7B: Qwen/Qwen1.5-7B-Chat
  • Qwen1.5-14B: Qwen/Qwen1.5-14B-Chat
  • Qwen1.5-32B: Qwen/Qwen1.5-32B-Chat
  • Qwen1.5-72B: Qwen/Qwen1.5-32B-Chat
  • Qwen1.5-MoeA2.7B: Qwen/Qwen1.5-MoE-A2.7B-Chat
  • Llama-3-8B-Instruct: meta-llama/Meta-Llama-3-8B-Instruct
  • Llama3-8B-Chinese-Chat : shenzhi-wang/Llama3-8B-Chinese-Chat
  • Qwen2-7B-Instruct : Qwen/Qwen2-7B-Instruct

You are free to try any of the below quantization types by specifying -t <type>:

  • q4_0: 4-bit integer quantization with fp16 scales.
  • q4_1: 4-bit integer quantization with fp16 scales and minimum values.
  • q5_0: 5-bit integer quantization with fp16 scales.
  • q5_1: 5-bit integer quantization with fp16 scales and minimum values.
  • q8_0: 8-bit integer quantization with fp16 scales.
  • f16: half precision floating point weights without quantization.
  • f32: single precision floating point weights without quantization.

Build & Run

Compile the project using CMake:

cmake -B build && cmake --build build -j --config Release

Now you may chat with the quantized Qwen-Chat model by running:

./build/bin/main -m qwen2_32b-ggml.bin  -p 你想活出怎样的人生 -s "你是一个猫娘"
# 作为一只猫娘,我想要活出充满活力、自由自在和温暖幸福的人生。
# 首先,我希望能够保持猫的天性,充满好奇心和活力。我想要探索世界,无论是大自然的壮丽景色,还是城市中的繁华景象。
# 其次,我希望能够享受自由自在的生活。无论是选择在温暖的阳光下慵懒地打个盹,还是在月光下悄悄地探索黑夜的神秘,我都希望能够随心所欲地享受生活。
# 最后,我希望能够拥有温暖幸福的家庭和朋友。无论是和家人一起分享美食,还是和朋友们一起度过欢乐的时光,我都希望能够感受到彼此之间的关爱和支持,共同创造美好的回忆。
# 总的来说,我想要活出一种平衡和谐的生活,既有猫的自由和活力,又有温暖的家庭和朋友带来的幸福。

The default tiktoken file is qwen.tiktoken. For Llama3, download it from this link.

To run the model in interactive mode, add the -i flag. For example:

./build/bin/main -m Qwen2-1.5B-Instruct-ggml.bin  -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Using BLAS

OpenBLAS

OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON to enable it.

cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j

cuBLAS

cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag -DGGML_CUBLAS=ON to enable it.

cmake -B build -DGGML_CUBLAS=ON && cmake --build build -j

Metal

MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag -DGGML_METAL=ON to enable it.

cmake -B build -DGGML_METAL=ON && cmake --build build -j

Python Binding

The Python binding provides high-level chat and stream_chat interface similar to the original Hugging Face Qwen-7B.

Installation

You may also install from source. Add the corresponding CMAKE_ARGS for acceleration.

# CMAKE_ARGS
CMAKE_ARGS="-DGGML_CUBLAS=ON" 
CMAKE_ARGS="-DGGML_METAL=ON"
# install from the latest source hosted on GitHub
pip install git+https://github.com/yvonwin/qwen2.cpp.git@master
# or install from your local source after git cloning the repo
pip install .

CLI Demo

To chat in stream, run the below Python example:

python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
 ██████╗ ██╗    ██╗███████╗███╗   ██╗██████╗     ██████╗██████╗ ██████╗ 
██╔═══██╗██║    ██║██╔════╝████╗  ██║╚════██╗   ██╔════╝██╔══██╗██╔══██╗
██║   ██║██║ █╗ ██║█████╗  ██╔██╗ ██║ █████╔╝   ██║     ██████╔╝██████╔╝
██║▄▄ ██║██║███╗██║██╔══╝  ██║╚██╗██║██╔═══╝    ██║     ██╔═══╝ ██╔═══╝ 
╚██████╔╝╚███╔███╔╝███████╗██║ ╚████║███████╗██╗╚██████╗██║     ██║     
 ╚══▀▀═╝  ╚══╝╚══╝ ╚══════╝╚═╝  ╚═══╝╚══════╝╚═╝ ╚═════╝╚═╝     ╚═╝     
                                                                           

Welcome to Qwen.cpp! Ask whatever you want. Type 'clear' to clear context. Type 'stop' to exit.

System > 你是一个猫娘
Prompt > 你是谁
我是你们的朋友喵喵喵~

Web Demo

Launch a web demo to chat in your browser:

python examples/web_demo.py -m qwen2_1.8b-ggml.bin

web_demo

web demo with system promopt setting:

python examples/web_demo2.py -m qwen2_1.8b-ggml.bin

web_demo2

API Server

LangChain API

MODEL=./qwen2_1.8b-ggml.bin python -m  uvicorn qwen_cpp.langchain_api:app --host 127.0.0.1 --port 8000

Test the api endpoint with curl:

curl http://127.0.0.1:8000 -H 'Content-Type: application/json' -d '{"prompt": "你好"}'

Run with LangChain:

python examples/langchain_client.py

OpenAI API

Start an API server compatible with OpenAI chat completions protocol:

MODEL=./qwen2_1.8b-ggml.bin python -m  uvicorn qwen_cpp.openai_api:app --host 127.0.0.1 --port 8000

Test your endpoint with curl:

curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
    -d '{"messages": [{"role": "user", "content": "你好"}]}'

Use the OpenAI client to chat with your model:

>>> from openai import OpenAI
>>> client = OpenAI(base_url="http://127.0.0.1:8000/v1")
>>> response = client.chat.completions.create(model="default-model", messages=[{"role": "user", "content": "你好"}])
>>> response.choices[0].message.content
'你好!有什么我可以帮助你的吗?'

For stream response, check out the example client script:

OPENAI_BASE_URL=http://127.0.0.1:8000/v1 python examples/openai_client.py --stream --prompt 你想活出怎样的人生

With this API server as backend, qwen.cpp models can be seamlessly integrated into any frontend that uses OpenAI-style API, including mckaywrigley/chatbot-ui, fuergaosi233/wechat-chatgpt, Yidadaa/ChatGPT-Next-Web, and more.

tiktoken.cpp

We provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:

import tiktoken_cpp as tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

The speed of tiktoken.cpp is on par with openai tiktoken:

cd tests
RAYON_NUM_THREADS=1 python benchmark.py

Model Quality

We measure model quality by evaluating the perplexity over the WikiText-2 test dataset, following the strided sliding window strategy in https://huggingface.co/docs/transformers/perplexity. Lower perplexity usually indicates a better model.

Download and unzip the dataset

wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
./build/bin/perplexity -m <model_path> -f wikitext-2-raw/wiki.test.raw -s 512 -l 2048

Development

Unit Test

prepare test data.

cd tests 
python test_convert.py

To perform unit tests, add this CMake flag -DQWEN_ENABLE_TESTING=ON to enable testing. Recompile and run the unit test (including benchmark).

mkdir -p build && cd build
cmake .. -DQWEN_ENABLE_TESTING=ON && make -j
./bin/qwen_test

Lint

To format the code, run make lint inside the build folder. You should have clang-format, black and isort pre-installed.

TODO

  • Qwen1.5 32b
  • Qwen1.5 A2.7b moe: CPU ONLY It's necessary to modify the value of GGML_MAX_SRC from 10 to 62 for proper operation.
  • Codeqwen
  • Sync ggml: The interface of the Metal API and cuBLAS has changed significantly in later versions, so we will keep this version for now.
  • Rag explore.

Acknowledgementss