Efficient Transformers Library

Latest news 🔥

[11/2024] finite adapters support allows mixed adapter usage for peft models.
[11/2024] Speculative decoding TLM QEFFAutoModelForCausalLM model can be compiled for returning more than 1 logits during decode for TLM.
[11/2024] Added support for Meta-Llama-3.3-70B-Instruct, Meta-Llama-3.2-1B and Meta-Llama-3.2-3B
[09/2024] AWQ/GPTQ 4-bit quantized models are supported
[09/2024] Now we support PEFT models

More

[09/2024] Added support for Gemma-2-Family
[09/2024] Added support for CodeGemma-Family
[09/2024] Added support for Gemma-Family
[09/2024] Added support for Meta-Llama-3.1-8B
[09/2024] Added support for Meta-Llama-3.1-8B-Instruct
[09/2024] Added support for Meta-Llama-3.1-70B-Instruct
[09/2024] Added support for granite-20b-code-base
[09/2024] Added support for granite-20b-code-instruct-8k
[09/2024] Added support for Starcoder1-15B
[08/2024] Added support for inference optimization technique continuous batching
[08/2024] Added support for Jais-adapted-70b
[08/2024] Added support for Jais-adapted-13b-chat
[08/2024] Added support for Jais-adapted-7b
[06/2024] Added support for GPT-J-6B
[06/2024] Added support for Qwen2-1.5B-Instruct
[06/2024] Added support for StarCoder2-15B
[06/2024] Added support for Phi3-Mini-4K-Instruct
[06/2024] Added support for Codestral-22B-v0.1
[06/2024] Added support for Vicuna-v1.5
[05/2024] Added support for Mixtral-8x7B & Mistral-7B-Instruct-v0.1.
[04/2024] Initial release of efficient transformers for seamless inference on pre-trained LLMs.

Overview

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).

Typically for LLMs, the library provides:

Reimplemented blocks from Transformers which enable efficient on-device retention of intermediate states.
Graph transformations to enable execution of key operations in lower precision
Graph transformations to replace some operations to other mathematically equivalent operations
Handling for under-flows and overflows in lower precision
Patcher modules to map weights of original model's operations to updated model's operations
Exporter module to export the model source into a ONNX Graph.
Sample example applications and demo notebooks
Unit test templates.

It is mandatory for each Pull Request to include tests such as:

If the PR is for adding support for a model, the tests should include successful execution of the model post changes (the changes included as part of PR) on Pytorch and ONNXRT. Successful exit criteria is MSE between output of original model and updated model.
If the PR modifies any common utilities, tests need to be included to execute tests of all models included in the library.

Quick Installation

# Create Python virtual env and activate it. (Recommended Python 3.10)
sudo apt install python3.10-venv
python3.10 -m venv qeff_env
source qeff_env/bin/activate
pip install -U pip

# Clone and Install the QEfficient Repo.
pip install git+https://github.com/quic/efficient-transformers

# Or build wheel package using the below command.
pip install build wheel
python -m build --wheel --outdir dist
pip install dist/QEfficient-0.0.1.dev0-py3-none-any.whl

For more details about using QEfficient via Cloud AI 100 Apps SDK, visit Linux Installation Guide

Documentation

Quick Start Guide
Python API
Validated Models
Models coming soon

Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/

Acknowledgements

Thanks to:

HuggingFace transformers for work in LLM GenAI modeling implementation
ONNX, Pytorch, ONNXruntime community.

Support

If you run into any problems with the code, please file Github issues directly to this repo.

Contributing

This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Efficient Transformers Library

Overview

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Quick Installation

Documentation

Acknowledgements

Support

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Efficient Transformers Library

Overview

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Quick Installation

Documentation

Acknowledgements

Support

Contributing