EETQ

EETQ(Easy & Efficient Quantization for Transformers)是一款针对transformer模型的量化工具

特点

新特性🔥: 引入gemv算子提升性能10%~30%.
高性能的INT8权重训练后量化算子
- 提取自FasterTransformer的高性能GEMM内核，可以更加方便集成至您的项目中
- 无需量化感知训练
使用Flash-Attention V2优化attention的推理性能
简单易用，只需一行代码即可适配您的PyTorch模型

快速开始

环境

cuda:>=11.4
python:>=3.8
gcc:>= 7.4.0
torch:>=1.14.0
transformers:>=4.27.0

安装

推荐使用Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

如果您的设备内存小于96GB，并且CPU核数很多，Ninja可能会运行过多的并行编译任务，可能会耗尽设备内存。为了限制并行编译任务的数量，您可以设置环境变量MAX_JOBS：

$ MAX_JOBS=4 pip install .

使用

量化torch模型

from eetq.utils import eet_quantize
eet_quantize(torch_model, init_only=False, include=[nn.Linear], exclude=["lm_head"], device="cuda:0")

量化torch模型并使用flash attention优化

...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")

# 推理
res = model.generate(...)

在TGI中使用eetq进行量化加速，PR链接

text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

在LoRAX中使用EETQ. 参考文档.

lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

参考用例

examples/models/llama_transformers_example.py

性能测试

llama-13b (test on 3090) prompt=1024, max_new_tokens=50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

EETQ

目录

特点

快速开始

环境

安装

使用

参考用例

性能测试

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

EETQ

目录

特点

快速开始

环境

安装

使用

参考用例

性能测试