Skip to content

Commit

Permalink
Merge branch 'master' into ov_file_path_util_file_size
Browse files Browse the repository at this point in the history
  • Loading branch information
mlukasze authored Feb 3, 2025
2 parents afacbb0 + 4c01a98 commit 99a18fb
Show file tree
Hide file tree
Showing 42 changed files with 882 additions and 891 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/windows_vs2019_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
repo_token: ${{ secrets.GITHUB_TOKEN }}
skip_when_only_listed_labels_set: 'docs'
skip_when_only_listed_files_changed: '*.md,*.rst,*.png,*.jpg,*.svg,*/layer_tests_summary/*,*/conformance/*'

- name: Get target branch
id: set_target_branch
run: |
Expand Down Expand Up @@ -192,7 +192,7 @@ jobs:
sparse-checkout: |
src/bindings/js
path: 'openvino'

- name: Download OpenVINO artifacts (JS)
uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
with:
Expand Down Expand Up @@ -223,7 +223,7 @@ jobs:
run: call npm test

- name: Add msbuild to PATH
uses: microsoft/setup-msbuild@v2
uses: microsoft/setup-msbuild@6fb02220983dee41ce7ae257b6f4d8f9bf5ed4ce # v2

- name: E2E of openvino-node package
working-directory: ${{ env.OPENVINO_JS_DIR }}/node
Expand Down
31 changes: 31 additions & 0 deletions .github/workflows/workflows_scans.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,37 @@ concurrency:
permissions: read-all

jobs:
codeql:
name: github_actions_workflows_scan/codeql
# Runner size impacts CodeQL analysis time. To learn more, please see:
# - https://gh.io/recommended-hardware-resources-for-running-codeql
# - https://gh.io/supported-runners-and-hardware-resources
# - https://gh.io/using-larger-runners
# Consider using larger runners for possible analysis time improvements.
runs-on: ubuntu-22.04
timeout-minutes: 60
permissions:
security-events: write
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
timeout-minutes: 15
with:
submodules: 'false'
sparse-checkout: .github/workflows

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@df409f7d9260372bd5f19e5b04e83cb3c43714ae # v3.27.9
with:
languages: "actions"
build-mode: "none"

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@df409f7d9260372bd5f19e5b04e83cb3c43714ae # v3.27.9
with:
category: "/language:actions"

semgrep:
name: github_actions_workflows_scan/semgrep
runs-on: ubuntu-latest
Expand Down
2 changes: 1 addition & 1 deletion docs/articles_en/about-openvino/release-notes-openvino.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Deprecation And Support
Using deprecated features and components is not advised. They are available to enable a smooth
transition to new solutions and will be discontinued in the future. To keep using discontinued
features, you will have to revert to the last LTS OpenVINO version supporting them.
For more details, refer to the `OpenVINO Legacy Features and Components <https://docs.openvino.ai/2024/documentation/legacy-features.html>__`
For more details, refer to the `OpenVINO Legacy Features and Components <https://docs.openvino.ai/2025/documentation/legacy-features.html>__`
page.


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ You need a model that is specific for your inference task. You can get it from o
Convert the Model
--------------------

If Your model requires conversion, check the `article <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/get-started-demos.html>`__ for information how to do it.
If Your model requires conversion, check the :doc:`article <../../../openvino-workflow/model-preparation>` for information how to do it.

.. _download-media:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ LLM Weight Compression
:maxdepth: 1
:hidden:

weight-compression/microscaling-quantization
weight-compression/4-bit-weight-quantization
weight-compression/microscaling-quantization



Weight compression enhances the efficiency of models by reducing their memory footprint,
Expand All @@ -16,14 +17,13 @@ Unlike full model quantization, where both weights and activations are quantized
only targets weights, keeping activations as floating-point numbers. This means preserving most
of the model's accuracy while improving its
speed and reducing its size. The reduction in size is especially noticeable with larger models.
For instance the 7 billion parameter Llama 2 model can be reduced
from about 25GB to 4GB using 4-bit weight compression.
For instance the 8 billion parameter Llama 3 model can be reduced
from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model.

.. note::

With smaller language models (i.e. less than 1B parameters), weight
With smaller language models (i.e. less than 1B parameters), low-bit weight
compression may result in more accuracy reduction than with larger models.
Therefore, weight compression is recommended for use with LLMs only.

LLMs and other GenAI models that require
extensive memory to store the weights during inference can benefit
Expand All @@ -36,7 +36,7 @@ from weight compression as it:
* improves inference speed by reducing the latency of memory access when computing the
operations with weights, for example, Linear layers. The weights are smaller and thus
faster to load from memory;
* unlike quantization, does not require sample data to calibrate the range of
* unlike full static quantization, does not require sample data to calibrate the range of
activation values.

Currently, `NNCF <https://github.com/openvinotoolkit/nncf>`__
Expand Down Expand Up @@ -64,7 +64,7 @@ by running the following command:
pip install optimum[openvino]
**8-bit weight quantization** offers a good balance between reducing the size and lowering the
accuracy of a model. It usually results in significant improvements for transformer-based models
accuracy of a model. It usually results in significant improvements for Transformer-based models
and guarantees good model performance for a vast majority of supported CPU and GPU platforms.
By default, weights are compressed asymmetrically to "INT8_ASYM" mode.

Expand Down Expand Up @@ -223,17 +223,6 @@ depending on the model.
For more details, refer to the article on how to
:doc:`infer LLMs using Optimum Intel <../../../openvino-workflow-generative/inference-with-optimum-intel>`.

The code snippet below shows how to do 4-bit quantization of the model weights represented
in OpenVINO IR using NNCF:

.. tab-set::

.. tab-item:: OpenVINO
:sync: openvino

.. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
:language: python
:fragment: [compression_4bit]

Refer to the article about
:doc:`4-bit weight quantization <./weight-compression/4-bit-weight-quantization>`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,12 @@ trade-offs after optimization:
There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
the original floating-point precision of the model weights (``INT8_ASYM`` is default value).

|


.. tip::

NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation
and GPTQ methods can be enabled all together to achieve better accuracy.

4-bit Weight Quantization with GPTQ
###################################
Expand Down
16 changes: 5 additions & 11 deletions docs/articles_en/openvino-workflow/model-optimization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,24 @@ In OpenVINO, the default optimization tool is NNCF (Neural Network Compression F
It is a `set of compression algorithms <https://github.com/openvinotoolkit/nncf/blob/develop/README.md>`__,
organized as a Python package, that make your models smaller and faster. Note that NNCF
is **not part of the OpenVINO package**, so it needs to be installed separately. It supports
models in **PyTorch**, **TensorFlow** , **ONNX**, and **OpenVINO IR** formats, offering
models in **OpenVINO IR**, **PyTorch**, **ONNX**, and **TensorFlow** formats, offering
the following main optimizations:

.. image:: ../assets/images/WHAT_TO_USE.svg


| :doc:`Weight Compression <model-optimization-guide/weight-compression>`:
| an easy-to-use method for Large Language Model footprint reduction and inference
| An easy-to-use method for Large Language Model footprint reduction and inference
acceleration.
| :doc:`Post-training Quantization <model-optimization-guide/quantizing-models-post-training>`:
| designed to optimize deep learning models by applying 8-bit integer quantization. Being
| Designed to optimize deep learning models by applying 8-bit integer quantization. Being
the easiest way to optimize a model it does not require its retraining or fine-tuning
but may result in a drop in accuracy. If the accuracy-performance tradeoff is not
acceptable, Training-time Optimization may be a better option.
| :doc:`Training-time Optimization <model-optimization-guide/compressing-models-during-training>`:
| involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
| Involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
as Quantization-aware Training. This kind of optimization requires the use of the model's
original framework, for NNCF, it is either PyTorch or TensorFlow.
Expand All @@ -54,13 +54,7 @@ Recommended workflows
3. If the accuracy drop is unacceptable, use quantization-aware training instead. It will give
you the same level of performance boost, with a smaller impact on accuracy.

* **Weight compression** works **only with LLMs**. Do not try to use it with other models.
* For **visual-multimodal** use cases, the encoder / decoder split approach may be recommended.





* **Weight compression** works with **LLMs**, **VLMs** and other Transformer-based models.



Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/convert-to-openvino-with-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ OpenVINO IR format


OpenVINO `Intermediate Representation
(IR) <https://docs.openvino.ai/2024/documentation/openvino-ir-format.html>`__
(IR) <https://docs.openvino.ai/2025/documentation/openvino-ir-format.html>`__
is the proprietary model format of OpenVINO. It is produced after
converting a model with model conversion API. Model conversion API
translates the frequently used deep learning operations to their
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -941,7 +941,7 @@ advance and fill it in as the inference requests are executed.
Let’s compare the models and plot the results.
**Note**: To get a more accurate benchmark, use the `Benchmark Python
Tool <https://docs.openvino.ai/2024/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__
Tool <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__
.. code:: ipython3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -623,7 +623,7 @@ Compare Performance of the FP32 IR Model and Quantized Models

To measure the inference performance of the ``FP32`` and ``INT8``
models, we use `Benchmark
Tool <https://docs.openvino.ai/2024/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__
Tool <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__
- OpenVINO’s inference performance measurement tool. Benchmark tool is a
command line application, part of OpenVINO development tools, that can
be run in the notebook with ``! benchmark_app`` or
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/ddcolor-image-colorization-with-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -499,7 +499,7 @@ Compare inference time of the FP16 and INT8 models

To measure the inference performance of OpenVINO FP16 and INT8 models,
use `Benchmark
Tool <https://docs.openvino.ai/2024/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.
Tool <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.

**NOTE**: For the most accurate performance estimation, it is
recommended to run ``benchmark_app`` in a terminal/command prompt
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/depth-anything-v2-with-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -977,7 +977,7 @@ Compare inference time of the FP16 and INT8 models

To measure the inference performance of OpenVINO FP16 and INT8 models,
use `Benchmark
Tool <https://docs.openvino.ai/2024/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.
Tool <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.

**NOTE**: For the most accurate performance estimation, it is
recommended to run ``benchmark_app`` in a terminal/command prompt
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/depth-anything-with-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -940,7 +940,7 @@ Compare inference time of the FP16 and INT8 models

To measure the inference performance of OpenVINO FP16 and INT8 models,
use `Benchmark
Tool <https://docs.openvino.ai/2024/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.
Tool <https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html>`__.

**NOTE**: For the most accurate performance estimation, it is
recommended to run ``benchmark_app`` in a terminal/command prompt
Expand Down
Loading

0 comments on commit 99a18fb

Please sign in to comment.