Skip to content

Commit

Permalink
Update LLM with GenAI guide
Browse files Browse the repository at this point in the history
  • Loading branch information
helena-intel committed Jan 31, 2025
1 parent 9930aea commit 917f253
Showing 1 changed file with 11 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,15 @@ Make sure your model works with NPU. Some models may not be supported, for examp
Export an LLM model via Hugging Face Optimum-Intel
##################################################

Since **symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU**, make
Since **symmetrically-quantized 4-bit (INT4) models are supported for inference on NPU**, make
sure to export the model with the proper conversion and optimization settings.

| You may export LLMs via Optimum-Intel, using one of two compression methods:
| **group quantization** - for both smaller and larger models,
| **channel-wise quantization** - remarkably effective but for models exceeding 1 billion parameters.
| **group quantization** - recommended for smaller models (<4B parameters).
| **channel-wise quantization** - recommended for larger models (>4B parameters),
You select one of the methods by setting the ``--group-size`` parameter to either ``128`` or
``-1``, respectively. See the following examples:
You select one of the methods by setting the ``--group-size`` parameter. For group quantization, use
``--group-size 128``, for channel-wise quantization use ``--group-size -1``. See the following examples:

.. tab-set::

Expand All @@ -63,12 +63,13 @@ You select one of the methods by setting the ``--group-size`` parameter to eithe
optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --ratio 1.0 --group-size -1 Llama-2-7b-chat-hf
.. tab-item:: Data-aware quantization

If you want to improve accuracy, make sure you:

1. Update NNCF: ``pip install nncf==2.13``
2. Use ``--scale_estimation --dataset <dataset_name>`` and accuracy aware quantization ``--awq``:
2. Use ``--scale-estimation --dataset <dataset_name>`` and accuracy aware quantization ``--awq``:

.. code-block:: console
:name: channel-wise-data-aware-quant
Expand All @@ -88,11 +89,12 @@ which do not require specifying quantization parameters:

.. code-block:: console
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ
pip install auto-gptq
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ Llama-2-7B-Chat-GPTQ
| Remember, NPU supports GenAI models quantized symmetrically to INT4.
| Below is a list of such models:
| Remember, NPU supports LLMs quantized symmetrically to INT4.
| Below is a list of supported models:
* meta-llama/Meta-Llama-3-8B-Instruct
* meta-llama/Llama-3.1-8B
Expand Down

0 comments on commit 917f253

Please sign in to comment.