Skip to content

Commit

Permalink
[Doc] Add Imbalance label guide and reorg (#1176)
Browse files Browse the repository at this point in the history
*Issue #, if available:*

*Description of changes:*
This PR adds one new document for handling imbalanced labels in
classification and regression. It also reorganize the `Advanced Topics`
slightly by removing the advanced-usage, and splitting its contents into
two separated docs. The PR modified the title of `Advanced Topics` into
`Practical & Advanced Guides` to reflect the complexity under this
category, and add new docs into the index page's `Practical and Advanced
Guides` section.

Preview
link:http://james4graphstorm.readthedocs.io/en/james_adv_imbalance/#

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-55-95.us-west-2.compute.internal>
Co-authored-by: Theodore Vasiloudis <theodoros.vasiloudis@gmail.com>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-244.us-west-2.compute.internal>
  • Loading branch information
5 people authored Feb 20, 2025
1 parent 03af1e4 commit 6758c8d
Show file tree
Hide file tree
Showing 6 changed files with 293 additions and 66 deletions.
47 changes: 0 additions & 47 deletions docs/source/advanced/advanced-usages.rst

This file was deleted.

73 changes: 73 additions & 0 deletions docs/source/advanced/imbalanced-labels.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
.. _imbalanced_labels:

Deal with Imbalanced Labels in Classification/Regression
==========================================

In some cases, the number of labels of different classes could be imbalanced, i.e., some classes
have either too many or too few data points. For example, most fraud detection tasks only have a
small number of fraudulent activities (positive labels) versus a huge number of legitimate activities
(negative labels). Even in regression tasks, it is possible to encounter many dominant values that
can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression
model performance a lot. For example, because too many negative labels are fit into models, models
may learn to classify all unseen samples as negative. GraphStorm
provides several ways to tackle the class imbalance problem.

For classification tasks, users can configure two arguments in command line interfaces (CLIs), the
``imbalance_class_weights`` and ``class_loss_func``.

The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models
to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus
90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually
for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``.
This places more importance on correctly classifying positive samples and less on negative ones. Below
is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file.

.. code-block:: yaml
imbalance_class_weights: 0.1,0.9
You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the
`focal loss function <https://arxiv.org/abs/1708.02002>`_ in binary classification tasks. The focal loss
function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`,
where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output
in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`,
corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help
update models on harder cases so as to detect more positive samples if the positive to negative ratio is small.
There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then
search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file.

.. code-block:: yaml
class_loss_func: focal
gamma: 10.0
alpha: 0.5
Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative
classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs
classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using
``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use
different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch
more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may
use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how
to set the ``return_proba`` in a YAML configuration file.

.. code-block:: yaml
return_proba: true
For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the
`shrinkage loss function <https://openaccess.thecvf.com/content_ECCV_2018/html/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.html>`_,
which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is
:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference
between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters.
You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage
loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is
an example about how to set the `shrinkage loss function` in a YAML configuration file.

.. code-block:: yaml
regression_loss_func: shrinkage
gamma: 0.2
alpha: 5
196 changes: 196 additions & 0 deletions docs/source/advanced/multi-target-ntypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
.. _multi_target_ntypes:

Multiple Target Node Types Training
===================================

When training on a heterogeneous graph, we often need to train a model by minimizing the objective
function on more than one node type. GraphStorm provides supports to achieve this goal. The recommended
method is to leverage GraphStorm's multi-task learning method, i.e., using multiple node tasks, and each
trained on one target node type.

More detailed guide of using multi-task learning can be found in
:ref:`Multi-task Learning in GraphStorm<multi_task_learning>`. This guide provides two examples of how
to conduct two target node type classification training on the `movielen 100k <https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset>`_
data, where the **movie** ("item" in the original data) and **user** node types have classification
labels associated.

Using multi-task learning for multiple target node types training (Recommended)
--------------------------------------------------------------------------------

Preparing the training data
............................

During graph construction step, you can define two classification tasks on the two node types as
shown in the JSON example below.

.. code-block:: json
{
"version": "gconstruct-v0.1",
"nodes": [
{
"node_type": "movie",
......
],
"labels": [
{
"label_col": "label_movie",
"task_type": "classification",
"split_pct": [0.8, 0.1, 0.1],
"mask_field_names": ["train_mask_movie",
"val_mask_movie",
"test_mask_movie"]
},
]
},
{
"node_type": "user",
......
],
"labels": [
{
"label_col": "label_user",
"task_type": "classification",
"split_pct": [0.2, 0.2, 0.6],
"mask_field_names": ["train_mask_user",
"val_mask_user",
"test_mask_user"]
},
]
},
],
......
}
The above configuration defines two classification tasks for the **movie** nodes and **user** nodes, respectively.
Each node type has its own "lable_col" and train/validation/test mask fields associated. Then you can
follow the instructions in :ref:`Run graph construction<run-graph-construction>` to use the GraphStorm
construction tool for creating partitioned graph data.

Define multi-task for model training
...............................

Now, you can specify two training tasks by providing the `multi_task_learning` configurations in
the training configuration YAML file, like the example below.

.. code-block:: yaml
---
version: 1.0
gsf:
basic:
...
multi_task_learning:
- node_classification:
target_ntype: "movie"
label_field: "label_movie"
mask_fields:
- "train_mask_movie"
- "val_mask_movie"
- "test_mask_movie"
num_classes: 10
task_weight: 0.5
- node_classification:
target_ntype: "user"
label_field: "label_user"
mask_fields:
- "train_mask_user"
- "val_mask_user"
- "test_mask_user"
task_weight: 1.0
...
The above configuration defines one classification task for the **movie** node type and another one
for the **user** node type. The two node classification tasks will take their own label name, i.e.,
`label_movie` and `label_user`, and their own train/validation/test mask fields. It also defines
which prioritizes user node classification (task_weight = 1.0) over movie node classification (task_weight = 0.5).
(`task_weight = 1.0`) than classification on **movie** nodes (`task_weight = 0.5`).

Run multi-task model training
..............................

You can use the `graphstorm.run.gs_multi_task_learning` command to run multi-task learning tasks,
like the following example.

.. code-block:: bash
python -m graphstorm.run.gs_multi_task_learning \
--workspace <PATH_TO_WORKSPACE> \
--num-trainers 1 \
--num-servers 1 \
--part-config <PATH_TO_GRAPH_DATA> \
--cf <PATH_TO_CONFIG> \
Run multi-task model Inference
...............................

For inference, you can use the same command line `graphstorm.run.gs_multi_task_learning` with an
additional argument `--inference` as the following:

.. code-block:: bash
python -m graphstorm.run.gs_multi_task_learning \
--inference \
--workspace <PATH_TO_WORKSPACE> \
--num-trainers 1 \
--num-servers 1 \
--part-config <PATH_TO_GRAPH_DATA> \
--cf <PATH_TO_CONFIG> \
--save-prediction-path <PATH_TO_OUTPUT>
The prediction results of each prediction tasks will be saved into different sub-directories under
<PATH_TO_OUTPUT>. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.

Using multi-target node type training (Not Recommended)
-------------------------------------------------------

You can also use GraphStorm's multi-target node types configuration. But this method is less
flexible than the multi-task learning method.

- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config
YAML file to minimize the objective function defined on mutiple target node types. For example,
by setting ``target_ntype`` as following, we can jointly optimize the objective function defined
on "movie" and "user" node types.

.. code-block:: yaml
target_ntype:
- movie
- user
- During evaluation, the users need to choose a single node type. For example, by setting
``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. GraphStorm
only supports evaluating on a single node type.

- Per target node type decoder: The users may also want to use a different decoder on each node type,
where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes``
in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will
create a decoder with an output dimension as 3 for movie node type, and a decoder with an output
dimension as 7 for user node type.

.. code-block:: yaml
num_classes:
movie: 3
user: 7
- Reweighting on loss function: The users may also want to use a customized loss function reweighting
on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and
``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does
not support different node types with different ``multilabel`` setting.

.. code-block:: yaml
multilabel:
movie: true
user: true
multilabel_weights:
movie: 0.1,0.2,0.3
user: 0.1,0.2,0.3,0.4,0.5,0.0
multilabel:
movie: false
user: false
imbalance_class_weights:
movie: 0.1,0.2,0.3
user: 0.1,0.2,0.3,0.4,0.5,0.0
14 changes: 8 additions & 6 deletions docs/source/advanced/multi-task-learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -277,15 +277,15 @@ You can define an edge feature reconstruction task as the following example:
eval_metric:
- "mse"
In the configuration, `target_etype` defines the target edge type to which the reconstruct edge feature
learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
In the configuration, `target_etype` defines the target edge type to which the reconstruct edge
feature learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
reconstructed. The other configs are same as edge regression tasks.


Run Model Training
~~~~~~~~~~~~~~~~~~~
GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` with an additional
argument `--inference` to run multi-task learning tasks. You can use the following command to start a multi-task training job:
GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` to run multi-task
learning tasks. You can use the following command to start a multi-task training job:

.. code-block:: bash
Expand All @@ -298,7 +298,8 @@ argument `--inference` to run multi-task learning tasks. You can use the followi
Run Model Inference
~~~~~~~~~~~~~~~~~~~~
You can use the same command line `graphstorm.run.gs_multi_task_learning` to run inference as following:
You can use the same command line `graphstorm.run.gs_multi_task_learning` with an additional
argument `--inference` to run inference as following:

.. code-block:: bash
Expand All @@ -312,7 +313,8 @@ You can use the same command line `graphstorm.run.gs_multi_task_learning` to run
--save-prediction-path <PATH_TO_OUTPUT>
The prediction results of each prediction tasks (node classification, node regression,
edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT.
The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.

Run Model Training on SageMaker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
12 changes: 6 additions & 6 deletions docs/source/cli/model-training-inference/configuration-run.rst
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,14 @@ General Configurations
- For link prediction tasks, the default value is ``mrr``.
- **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression.

- Yaml: ``gamma: 10.0``
- Argument: ``--gamma 10.0``
- Default value: None
- Yaml: ``gamma: 2.0``
- Argument: ``--gamma 2.0``
- Default value: ``2.0`` in focal loss function; ``0.2`` in shrinkage loss function; ``12.0`` in ``distmult``, ``RotatE``, and ``TransE`` link prediction decoders.
- **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression.

- Yaml: ``alpha: 10.0``
- Argument: ``--alpha 10.0``
- Default value: None
- Yaml: ``alpha: 0.25``
- Argument: ``--alpha 0.25``
- Default value: ``0.25`` in focal loss function; ``10.0`` in shrinkage loss function.

Classification and Regression Task
```````````````````````````````````
Expand Down
Loading

0 comments on commit 6758c8d

Please sign in to comment.