Merge branch 'main' into pull-2.6-binaries

pytorch · Jan 24, 2025 · e263354 · e263354
2 parents 88cae7b + 2a30921
commit e263354
Show file tree

Hide file tree

Showing 10 changed files with 207 additions and 12 deletions.
diff --git a/.lycheeignore b/.lycheeignore
@@ -12,3 +12,9 @@ https://pytorch.org/tutorials/beginner/colab/n
 
 # Ignore local host link from intermediate_source/tensorboard_tutorial.rst
 http://localhost:6006
+
+# Ignore local host link from recipes_source/deployment_with_flask.rst
+http://localhost:5000/predict 
+
+# Ignore local host link from advanced_source/cpp_frontend.rst 
+https://www.uber.com/blog/deep-neuroevolution/
diff --git a/advanced_source/cpp_frontend.rst b/advanced_source/cpp_frontend.rst
@@ -57,7 +57,7 @@ the right tool for the job. Examples for such environments include:
   Multiprocessing is an alternative, but not as scalable and has significant
   shortcomings. C++ has no such constraints and threads are easy to use and
   create. Models requiring heavy parallelization, like those used in `Deep
-  Neuroevolution <https://eng.uber.com/deep-neuroevolution/>`_, can benefit from
+  Neuroevolution <https://www.uber.com/blog/deep-neuroevolution/>`_, can benefit from
   this.
 - **Existing C++ Codebases**: You may be the owner of an existing C++
   application doing anything from serving web pages in a backend server to
@@ -662,7 +662,7 @@ Defining the DCGAN Modules
 We now have the necessary background and introduction to define the modules for
 the machine learning task we want to solve in this post. To recap: our task is
 to generate images of digits from the `MNIST dataset
-<http://yann.lecun.com/exdb/mnist/>`_. We want to use a `generative adversarial
+<https://huggingface.co/datasets/ylecun/mnist>`_. We want to use a `generative adversarial
 network (GAN)
 <https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf>`_ to solve
 this task. In particular, we'll use a `DCGAN architecture

diff --git a/en-wordlist.txt b/en-wordlist.txt
@@ -81,6 +81,8 @@ FX
 FX's
 FairSeq
 Fastpath
+FakeTensor
+FakeTensors
 FFN
 FloydHub
 FloydHub's
@@ -368,6 +370,8 @@ downsample
 downsamples
 dropdown
 dtensor
+dtype
+dtypes
 duration
 elementwise
 embeddings
@@ -615,6 +619,7 @@ triton
 uint
 UX
 umap
+unbacked
 uncomment
 uncommented
 underflowing
@@ -651,7 +656,6 @@ RecSys
 TorchRec
 sharding
 TBE
-dtype
 EBC
 sharder
 hyperoptimized

diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst
@@ -11,7 +11,7 @@ It also comes with considerable engineering complexity to handle the training of
 `PyTorch FSDP <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`__, released in PyTorch 1.11 makes this easier.
 
 In this tutorial, we show how to use `FSDP APIs <https://pytorch.org/docs/stable/fsdp.html>`__, for simple MNIST models that can be extended to other larger models such as `HuggingFace BERT models <https://huggingface.co/blog/zero-deepspeed-fairscale>`__, 
-`GPT 3 models up to 1T parameters <https://pytorch.medium.com/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ . The sample DDP MNIST code has been borrowed from `here <https://github.com/yqhu/mnist_examples>`__. 
+`GPT 3 models up to 1T parameters <https://pytorch.medium.com/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ . The sample DDP MNIST code courtesy of `Patrick Hu <https://github.com/yqhu/>`_. 
 
 
 How FSDP works

diff --git a/intermediate_source/ddp_series_minGPT.rst b/intermediate_source/ddp_series_minGPT.rst
@@ -6,7 +6,7 @@ training <ddp_series_multinode.html>`__ \|\| **minGPT Training**
 Training “real-world” models with DDP
 =====================================
 
-Authors: `Suraj Subramanian <https://github.com/suraj813>`__
+Authors: `Suraj Subramanian <https://github.com/subramen>`__
 
 .. grid:: 2
 

diff --git a/intermediate_source/ddp_series_multinode.rst b/intermediate_source/ddp_series_multinode.rst
@@ -6,7 +6,7 @@ training** \|\| `minGPT Training <ddp_series_minGPT.html>`__
 Multinode Training
 ==================
 
-Authors: `Suraj Subramanian <https://github.com/suraj813>`__
+Authors: `Suraj Subramanian <https://github.com/subramen>`__
 
 .. grid:: 2
 

diff --git a/intermediate_source/dynamic_quantization_bert_tutorial.rst b/intermediate_source/dynamic_quantization_bert_tutorial.rst
@@ -138,7 +138,7 @@ the following helper functions: one for converting the text examples
 into the feature vectors; The other one for measuring the F1 score of
 the predicted result.
 
-The `glue_convert_examples_to_features <https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py>`_ function converts the texts into input features:
+The `glue_convert_examples_to_features <https://github.com/huggingface/transformers/blob/main/src/transformers/data/datasets/glue.py>`_ function converts the texts into input features:
 
 -  Tokenize the input sequences;
 -  Insert [CLS] in the beginning;
@@ -147,7 +147,7 @@ The `glue_convert_examples_to_features <https://github.com/huggingface/transform
 -  Generate token type ids to indicate whether a token belongs to the
    first sequence or the second sequence.
 
-The `glue_compute_metrics <https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py>`_  function has the compute metrics with
+The `glue_compute_metrics <https://github.com/huggingface/transformers/blob/main/src/transformers/data/metrics/__init__.py#L60>`_  function has the compute metrics with
 the `F1 score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html>`_, which
 can be interpreted as a weighted average of the precision and recall,
 where an F1 score reaches its best value at 1 and worst score at 0. The
@@ -273,7 +273,7 @@ We load the tokenizer and fine-tuned BERT sequence classifier model
 2.3 Define the tokenize and evaluation function
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We reuse the tokenize and evaluation function from `HuggingFace <https://github.com/huggingface/transformers/blob/master/examples/run_glue.py>`_.
+We reuse the tokenize and evaluation function from `HuggingFace <https://github.com/huggingface/transformers/blob/main/examples/legacy/pytorch-lightning/run_glue.py>`_.
 
 .. code:: python
 

diff --git a/intermediate_source/torch_export_tutorial.py b/intermediate_source/torch_export_tutorial.py
@@ -629,6 +629,191 @@ def forward(self, x, y):
     "bool_val": None,
 }
 
+######################################################################
+# Data-dependent errors
+# ---------------------
+#
+# While trying to export models, you have may have encountered errors like "Could not guard on data-dependent expression", or Could not extract specialized integer from data-dependent expression".
+# These errors exist because ``torch.export()`` compiles programs using FakeTensors, which symbolically represent their real tensor counterparts. While these have equivalent symbolic properties
+# (e.g. sizes, strides, dtypes), they diverge in that FakeTensors do not contain any data values. While this avoids unnecessary memory usage and expensive computation, it does mean that export may be
+# unable to out-of-the-box compile parts of user code where compilation relies on data values. In short, if the compiler requires a concrete, data-dependent value in order to proceed, it will error out,
+# complaining that the value is not available.
+#
+# Data-dependent values appear in many places, and common sources are calls like ``item()``, ``tolist()``, or ``torch.unbind()`` that extract scalar values from tensors.
+# How are these values represented in the exported program? In the `Constraints/Dynamic Shapes <https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html#constraints-dynamic-shapes>`_
+# section, we talked about allocating symbols to represent dynamic input dimensions.
+# The same happens here: we allocate symbols for every data-dependent value that appears in the program. The important distinction is that these are "unbacked" symbols,
+# in contrast to the "backed" symbols allocated for input dimensions. The `"backed/unbacked" <https://pytorch.org/docs/main/export.programming_model.html#basics-of-symbolic-shapes>`_
+# nomenclature refers to the presence/absence of a "hint" for the symbol: a concrete value backing the symbol, that can inform the compiler on how to proceed.
+#
+# In the input shape symbol case (backed symbols), these hints are simply the sample input shapes provided, which explains why control-flow branching is determined by the sample input properties.
+# For data-dependent values, the symbols are taken from FakeTensor "data" during tracing, and so the compiler doesn't know the actual values (hints) that these symbols would take on.
+#
+# Let's see how these show up in exported programs:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        b = y.tolist()
+        return b + [a]
+
+inps = (
+    torch.tensor(1),
+    torch.tensor([2, 3]),
+)
+ep = export(Foo(), inps)
+print(ep)
+
+######################################################################
+# The result is that 3 unbacked symbols (notice they're prefixed with "u", instead of the usual "s" for input shape/backed symbols) are allocated and returned:
+# 1 for the ``item()`` call, and 1 for each of the elements of ``y`` with the ``tolist()`` call.
+# Note from the range constraints field that these take on ranges of ``[-int_oo, int_oo]``, not the default ``[0, int_oo]`` range allocated to input shape symbols,
+# since we have no information on what these values are - they don't represent sizes, so don't necessarily have positive values.
+
+######################################################################
+# Guards, torch._check()
+# ^^^^^^^^^^^^^^^^^^^^^^
+#
+# But the case above is easy to export, because the concrete values of these symbols aren't used in any compiler decision-making; all that's relevant is that the return values are unbacked symbols.
+# The data-dependent errors highlighted in this section are cases like the following, where `data-dependent guards <https://pytorch.org/docs/main/export.programming_model.html#control-flow-static-vs-dynamic>`_ are encountered:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        if a // 2 >= 5:
+            return y + 2
+        else:
+            return y * 5
+
+######################################################################
+# Here we actually need the "hint", or the concrete value of ``a`` for the compiler to decide whether to trace ``return y + 2`` or ``return y * 5`` as the output.
+# Because we trace with FakeTensors, we don't know what ``a // 2 >= 5`` actually evaluates to, and export errors out with "Could not guard on data-dependent expression ``u0 // 2 >= 5 (unhinted)``".
+#
+# So how do we export this toy model? Unlike ``torch.compile()``, export requires full graph compilation, and we can't just graph break on this. Here are some basic options:
+#
+# 1. Manual specialization: we could intervene by selecting the branch to trace, either by removing the control-flow code to contain only the specialized branch, or using ``torch.compiler.is_compiling()`` to guard what's traced at compile-time.
+# 2. ``torch.cond()``: we could rewrite the control-flow code to use ``torch.cond()`` so we don't specialize on a branch.
+#
+# While these options are valid, they have their pitfalls. Option 1 sometimes requires drastic, invasive rewrites of the model code to specialize, and ``torch.cond()`` is not a comprehensive system for handling data-dependent errors.
+# As we will see, there are data-dependent errors that do not involve control-flow.
+#
+# The generally recommended approach is to start with ``torch._check()`` calls. While these give the impression of purely being assert statements, they are in fact a system of informing the compiler on properties of symbols.
+# While a ``torch._check()`` call does act as an assertion at runtime, when traced at compile-time, the checked expression is sent to the symbolic shapes subsystem for reasoning, and any symbol properties that follow from the expression being true,
+# are stored as symbol properties (provided it's smart enough to infer those properties). So even if unbacked symbols don't have hints, if we're able to communicate properties that are generally true for these symbols via
+# ``torch._check()`` calls, we can potentially bypass data-dependent guards without rewriting the offending model code.
+#
+# For example in the model above, inserting ``torch._check(a >= 10)`` would tell the compiler that ``y + 2`` can always be returned, and ``torch._check(a == 4)`` tells it to return ``y * 5``.
+# See what happens when we re-export this model.
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        torch._check(a >= 10)
+        torch._check(a <= 60)
+        if a // 2 >= 5:
+            return y + 2
+        else:
+            return y * 5
+
+inps = (
+    torch.tensor(32),
+    torch.randn(4),
+)
+ep = export(Foo(), inps)
+print(ep)
+
+######################################################################
+# Export succeeds, and note from the range constraints field that ``u0`` takes on a range of ``[10, 60]``.
+#
+# So what information do ``torch._check()`` calls actually communicate? This varies as the symbolic shapes subsystem gets smarter, but at a fundamental level, these are generally true:
+#
+# 1. Equality with non-data-dependent expressions: ``torch._check()`` calls that communicate equalities like ``u0 == s0 + 4`` or ``u0 == 5``.
+# 2. Range refinement: calls that provide lower or upper bounds for symbols, like the above.
+# 3. Some basic reasoning around more complicated expressions: inserting ``torch._check(a < 4)`` will typically tell the compiler that ``a >= 4`` is false. Checks on complex expressions like ``torch._check(a ** 2 - 3 * a <= 10)`` will typically get you past identical guards.
+#
+# As mentioned previously, ``torch._check()`` calls have applicability outside of data-dependent control flow. For example, here's a model where ``torch._check()`` insertion
+# prevails while manual specialization & ``torch.cond()`` do not:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        return y[a]
+
+inps = (
+    torch.tensor(32),
+    torch.randn(60),
+)
+export(Foo(), inps)
+
+######################################################################
+# Here is a scenario where ``torch._check()`` insertion is required simply to prevent an operation from failing. The export call will fail with
+# "Could not guard on data-dependent expression ``-u0 > 60``", implying that the compiler doesn't know if this is a valid indexing operation -
+# if the value of ``x`` is out-of-bounds for ``y`` or not. Here, manual specialization is too prohibitive, and ``torch.cond()`` has no place.
+# Instead, informing the compiler of ``u0``'s range is sufficient:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        torch._check(a >= 0)
+        torch._check(a <= y.shape[0])
+        return y[a]
+
+inps = (
+    torch.tensor(32),
+    torch.randn(60),
+)
+ep = export(Foo(), inps)
+print(ep)
+
+######################################################################
+# Specialized values
+# ^^^^^^^^^^^^^^^^^^
+#
+# Another category of data-dependent error happens when the program attempts to extract a concrete data-dependent integer/float value
+# while tracing. This looks something like "Could not extract specialized integer from data-dependent expression", and is analogous to
+# the previous class of errors - if these occur when attempting to evaluate concrete integer/float values, data-dependent guard errors arise
+# with evaluating concrete boolean values.
+#
+# This error typically occurs when there is an explicit or implicit ``int()`` cast on a data-dependent expression. For example, this list comprehension
+# has a `range()` call that implicitly does an ``int()`` cast on the size of the list:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        b = torch.cat([y for y in range(a)], dim=0)
+        return b + int(a)
+
+inps = (
+    torch.tensor(32),
+    torch.randn(60),
+)
+export(Foo(), inps, strict=False)
+
+######################################################################
+# For these errors, some basic options you have are:
+#
+# 1. Avoid unnecessary ``int()`` cast calls, in this case the ``int(a)`` in the return statement.
+# 2. Use ``torch._check()`` calls; unfortunately all you may be able to do in this case is specialize (with ``torch._check(a == 60)``).
+# 3. Rewrite the offending code at a higher level. For example, the list comprehension is semantically a ``repeat()`` op, which doesn't involve an ``int()`` cast. The following rewrite avoids data-dependent errors:
+
+class Foo(torch.nn.Module):
+    def forward(self, x, y):
+        a = x.item()
+        b = y.unsqueeze(0).repeat(a, 1)
+        return b + a
+
+inps = (
+    torch.tensor(32),
+    torch.randn(60),
+)
+ep = export(Foo(), inps, strict=False)
+print(ep)
+
+######################################################################
+# Data-dependent errors can be much more involved, and there are many more options in your toolkit to deal with them: ``torch._check_is_size()``, ``guard_size_oblivious()``, or real-tensor tracing, as starters.
+# For more in-depth guides, please refer to the `Export Programming Model <https://pytorch.org/docs/main/export.programming_model.html>`_,
+# or `Dealing with GuardOnDataDependentSymNode errors <https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs>`_.
+
 ######################################################################
 # Custom Ops
 # ----------

diff --git a/intermediate_source/torchserve_with_ipex.rst b/intermediate_source/torchserve_with_ipex.rst
@@ -379,8 +379,8 @@ For interested readers, please check out the following documents:
 
 - `CPU specific optimizations <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations>`_
 - `Maximize Performance of Intel® Software Optimization for PyTorch* on CPU <https://www.intel.com/content/www/us/en/developer/articles/technical/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html>`_
-- `Performance Tuning Guide <https://intel.github.io/intel-extension-for-pytorch/tutorials/performance_tuning/tuning_guide.html>`_
-- `Launch Script Usage Guide <https://intel.github.io/intel-extension-for-pytorch/tutorials/performance_tuning/launch_script.html>`_
+- `Performance Tuning Guide <https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html>`_
+- `Launch Script Usage Guide <https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/launch_script.html>`_
 - `Top-down Microarchitecture Analysis Method <https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html>`_
 - `Configuring oneDNN for Benchmarking <https://oneapi-src.github.io/oneDNN/dev_guide_performance_settings.html#benchmarking-settings>`_
 - `Intel® VTune™ Profiler <https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html#gs.tcbgpa>`_

diff --git a/prototype_source/fx_graph_mode_ptq_static.rst b/prototype_source/fx_graph_mode_ptq_static.rst
@@ -253,7 +253,7 @@ of the observers for activation and weight. ``QConfigMapping`` contains mapping
 
 
 Utility functions related to ``qconfig`` can be found in the `qconfig <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/qconfig.py>`_ file
-while those for ``QConfigMapping`` can be found in the `qconfig_mapping <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/fx/qconfig_mapping.py>`
+while those for ``QConfigMapping`` can be found in the `qconfig_mapping <https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/fx/qconfig_mapping_utils.py>`
 
 .. code:: python