[Doc] Add Imbalance label guide and reorg (#1176)

*Issue #, if available:* *Description of changes:* This PR adds one new document for handling imbalanced labels in classification and regression. It also reorganize the `Advanced Topics` slightly by removing the advanced-usage, and splitting its contents into two separated docs. The PR modified the title of `Advanced Topics` into `Practical & Advanced Guides` to reflect the complexity under this category, and add new docs into the index page's `Practical and Advanced Guides` section. Preview link:http://james4graphstorm.readthedocs.io/en/james_adv_imbalance/# By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-55-95.us-west-2.compute.internal> Co-authored-by: Theodore Vasiloudis <theodoros.vasiloudis@gmail.com> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-244.us-west-2.compute.internal>
awslabs · Feb 20, 2025 · 6758c8d · 6758c8d
1 parent 03af1e4
commit 6758c8d
Show file tree

Hide file tree

Showing 6 changed files with 293 additions and 66 deletions.
diff --git a/docs/source/advanced/advanced-usages.rst b/docs/source/advanced/advanced-usages.rst
diff --git a/docs/source/advanced/imbalanced-labels.rst b/docs/source/advanced/imbalanced-labels.rst
@@ -0,0 +1,73 @@
+.. _imbalanced_labels:
+
+Deal with Imbalanced Labels in Classification/Regression
+==========================================
+
+In some cases, the number of labels of different classes could be imbalanced, i.e., some classes
+have either too many or too few data points. For example, most fraud detection tasks only have a
+small number of fraudulent activities (positive labels) versus a huge number of legitimate activities
+(negative labels). Even in regression tasks, it is possible to encounter many dominant values that
+can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression
+model performance a lot. For example, because too many negative labels are fit into models, models
+may learn to classify all unseen samples as negative. GraphStorm
+provides several ways to tackle the class imbalance problem.
+
+For classification tasks, users can configure two arguments in command line interfaces (CLIs), the
+``imbalance_class_weights`` and ``class_loss_func``.
+
+The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models
+to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus
+90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually
+for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``.
+This places more importance on correctly classifying positive samples and less on negative ones. Below
+is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    imbalance_class_weights: 0.1,0.9
+
+You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the
+`focal loss function <https://arxiv.org/abs/1708.02002>`_ in binary classification tasks. The focal loss
+function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`,
+where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output
+in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`,
+corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help
+update models on harder cases so as to detect more positive samples if the positive to negative ratio is small.
+There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then
+search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    class_loss_func: focal
+
+    gamma: 10.0
+    alpha: 0.5
+
+Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative
+classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs
+classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using
+``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use
+different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch
+more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may
+use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how
+to set the ``return_proba`` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    return_proba: true
+
+For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the
+`shrinkage loss function <https://openaccess.thecvf.com/content_ECCV_2018/html/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.html>`_,
+which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is
+:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference
+between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters.
+You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage
+loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is
+an example about how to set the `shrinkage loss function` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    regression_loss_func: shrinkage
+
+    gamma: 0.2
+    alpha: 5
diff --git a/docs/source/advanced/multi-target-ntypes.rst b/docs/source/advanced/multi-target-ntypes.rst
@@ -0,0 +1,196 @@
+.. _multi_target_ntypes:
+
+Multiple Target Node Types Training
+===================================
+
+When training on a heterogeneous graph, we often need to train a model by minimizing the objective
+function on more than one node type. GraphStorm provides supports to achieve this goal. The recommended
+method is to leverage GraphStorm's multi-task learning method, i.e., using multiple node tasks, and each
+trained on one target node type. 
+
+More detailed guide of using multi-task learning can be found in
+:ref:`Multi-task Learning in GraphStorm<multi_task_learning>`. This guide provides two examples of how
+to conduct two target node type classification training on the `movielen 100k <https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset>`_
+data, where the **movie** ("item" in the original data) and **user** node types have classification
+labels associated.
+
+Using multi-task learning for multiple target node types training (Recommended)
+--------------------------------------------------------------------------------
+
+Preparing the training data
+............................
+
+During graph construction step, you can define two classification tasks on the two node types as
+shown in the JSON example below.
+
+.. code-block:: json
+
+    {
+        "version": "gconstruct-v0.1",
+        "nodes": [
+            {
+                "node_type": "movie",
+                ......
+                ],
+                "labels": [
+                    {
+                        "label_col": "label_movie",
+                        "task_type": "classification",
+                        "split_pct":	[0.8, 0.1, 0.1],
+                        "mask_field_names": ["train_mask_movie",
+                                             "val_mask_movie",
+                                             "test_mask_movie"]
+                    },
+                ]
+            },
+            {
+                "node_type": "user",
+                ......
+                ],
+                "labels": [
+                    {
+                        "label_col": "label_user",
+                        "task_type": "classification",
+                        "split_pct":	[0.2, 0.2, 0.6],
+                        "mask_field_names": ["train_mask_user",
+                                             "val_mask_user",
+                                             "test_mask_user"]
+                    },
+                ]
+            },
+        ],
+        ......
+    }
+
+The above configuration defines two classification tasks for the **movie** nodes and **user** nodes, respectively.
+Each node type has its own "lable_col" and train/validation/test mask fields associated. Then you can
+follow the instructions in :ref:`Run graph construction<run-graph-construction>` to use the GraphStorm
+construction tool for creating partitioned graph data.
+
+Define multi-task for model training
+...............................
+
+Now, you can specify two training tasks by providing the `multi_task_learning` configurations in
+the training configuration YAML file, like the example below.
+
+.. code-block:: yaml
+
+    ---
+    version: 1.0
+    gsf:
+        basic:
+            ...
+        multi_task_learning:
+            - node_classification:
+                target_ntype: "movie"
+                label_field: "label_movie"
+                mask_fields:
+                    - "train_mask_movie"
+                    - "val_mask_movie"
+                    - "test_mask_movie"
+                num_classes: 10
+                task_weight: 0.5
+            - node_classification:
+                target_ntype: "user"
+                label_field: "label_user"
+                mask_fields:
+                    - "train_mask_user"
+                    - "val_mask_user"
+                    - "test_mask_user"
+                task_weight: 1.0
+            ...
+
+The above configuration defines one classification task for the **movie** node type and another one
+for the **user** node type. The two node classification tasks will take their own label name, i.e.,
+`label_movie` and `label_user`, and their own train/validation/test mask fields. It also defines
+which prioritizes user node classification (task_weight = 1.0) over movie node classification (task_weight = 0.5).
+(`task_weight = 1.0`) than classification on **movie** nodes (`task_weight = 0.5`).
+
+Run multi-task model training
+..............................
+
+You can use the `graphstorm.run.gs_multi_task_learning` command to run multi-task learning tasks,
+like the following example.
+
+.. code-block:: bash
+
+    python -m graphstorm.run.gs_multi_task_learning \
+              --workspace <PATH_TO_WORKSPACE> \
+              --num-trainers 1 \
+              --num-servers 1 \
+              --part-config <PATH_TO_GRAPH_DATA> \
+              --cf <PATH_TO_CONFIG> \
+
+Run multi-task model Inference
+...............................
+
+For inference, you can use the same command line `graphstorm.run.gs_multi_task_learning`  with an
+additional argument `--inference` as the following:
+
+.. code-block:: bash
+
+    python -m graphstorm.run.gs_multi_task_learning \
+              --inference \
+              --workspace <PATH_TO_WORKSPACE> \
+              --num-trainers 1 \
+              --num-servers 1 \
+              --part-config <PATH_TO_GRAPH_DATA> \
+              --cf <PATH_TO_CONFIG> \
+              --save-prediction-path <PATH_TO_OUTPUT>
+
+The prediction results of each prediction tasks will be saved into different sub-directories under
+<PATH_TO_OUTPUT>. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
+
+Using multi-target node type training (Not Recommended)
+-------------------------------------------------------
+
+You can also use GraphStorm's multi-target node types configuration. But this method is less
+flexible than the multi-task learning method.
+
+- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config
+YAML file to minimize the objective function defined on mutiple target node types. For example,
+by setting ``target_ntype`` as following, we can jointly optimize the objective function defined
+on "movie" and "user" node types.
+
+  .. code-block:: yaml
+
+    target_ntype:
+    -  movie
+    -  user
+
+- During evaluation, the users need to choose a single node type. For example, by setting
+  ``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. GraphStorm
+  only supports evaluating on a single node type.
+
+- Per target node type decoder: The users may also want to use a different decoder on each node type,
+  where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes``
+  in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will
+  create a decoder with an output dimension as 3 for movie node type, and a decoder with an output
+  dimension as 7 for user node type.
+
+  .. code-block:: yaml
+
+    num_classes:
+      movie:  3
+      user:  7
+
+- Reweighting on loss function: The users may also want to use a customized loss function reweighting
+  on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and
+  ``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does
+  not support different node types with different ``multilabel`` setting.
+
+  .. code-block:: yaml
+
+    multilabel:
+      movie:  true
+      user:  true
+    multilabel_weights:
+      movie:  0.1,0.2,0.3
+      user:  0.1,0.2,0.3,0.4,0.5,0.0
+
+    multilabel:
+      movie:  false
+      user:  false
+    imbalance_class_weights:
+      movie:  0.1,0.2,0.3
+      user:  0.1,0.2,0.3,0.4,0.5,0.0
diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst
@@ -277,15 +277,15 @@ You can define an edge feature reconstruction task as the following example:
                 eval_metric:
                     - "mse"
 
-In the configuration, `target_etype` defines the target edge type to which the reconstruct edge feature
-learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
+In the configuration, `target_etype` defines the target edge type to which the reconstruct edge
+feature learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
 reconstructed. The other configs are same as edge regression tasks.
 
 
 Run Model Training
 ~~~~~~~~~~~~~~~~~~~
-GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` with an additional
-argument `--inference` to run multi-task learning tasks. You can use the following command to start a multi-task training job:
+GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` to run multi-task
+learning tasks. You can use the following command to start a multi-task training job:
 
 .. code-block:: bash
 
@@ -298,7 +298,8 @@ argument `--inference` to run multi-task learning tasks. You can use the followi
 
 Run Model Inference
 ~~~~~~~~~~~~~~~~~~~~
-You can use the same command line `graphstorm.run.gs_multi_task_learning` to run inference as following:
+You can use the same command line `graphstorm.run.gs_multi_task_learning` with an additional
+argument `--inference` to run inference as following:
 
 .. code-block:: bash
 
@@ -312,7 +313,8 @@ You can use the same command line `graphstorm.run.gs_multi_task_learning` to run
               --save-prediction-path <PATH_TO_OUTPUT>
 
 The prediction results of each prediction tasks (node classification, node regression,
-edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
+edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT.
+The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
 
 Run Model Training on SageMaker
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/docs/source/cli/model-training-inference/configuration-run.rst b/docs/source/cli/model-training-inference/configuration-run.rst
@@ -397,14 +397,14 @@ General Configurations
             - For link prediction tasks, the default value is ``mrr``.
 - **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression.
 
-    - Yaml: ``gamma: 10.0``
-    - Argument: ``--gamma 10.0``
-    - Default value: None
+    - Yaml: ``gamma: 2.0``
+    - Argument: ``--gamma 2.0``
+    - Default value: ``2.0`` in focal loss function; ``0.2`` in shrinkage loss function; ``12.0`` in ``distmult``, ``RotatE``, and ``TransE`` link prediction decoders.
 - **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression.
 
-    - Yaml: ``alpha: 10.0``
-    - Argument: ``--alpha 10.0``
-    - Default value: None
+    - Yaml: ``alpha: 0.25``
+    - Argument: ``--alpha 0.25``
+    - Default value: ``0.25`` in focal loss function; ``10.0`` in shrinkage loss function.
 
 Classification and Regression Task
 ```````````````````````````````````