Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Add Imbalance label guide and reorg #1176

Merged
merged 33 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7ea7a41
init adv imbalance doc
Feb 14, 2025
6b4e513
1st version
Feb 15, 2025
7c0c549
1st version
Feb 15, 2025
a99abde
reorg advanced topic
Feb 17, 2025
6bac3c6
enhance index page
Feb 17, 2025
f7decc2
add examples in imbalance
Feb 17, 2025
f13130f
refine
Feb 18, 2025
84bbd86
break lines
Feb 18, 2025
14916ab
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
ba3f2af
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
bc095eb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
cf1a394
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 18, 2025
bc383f5
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
c1728eb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
04bb099
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
42ff53a
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
bfacd07
Update docs/source/index.rst
zhjwy9343 Feb 18, 2025
f324aaf
Update docs/source/index.rst
zhjwy9343 Feb 18, 2025
a155edb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
1324d6a
change contents
Feb 18, 2025
ae449eb
rewrite multiple target node
Feb 19, 2025
354b282
rewrite multiple target node
Feb 19, 2025
e566761
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
9687713
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
c7c2de8
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
f90e323
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
0c0ef4b
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
3e185a4
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
7254206
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
e0408d5
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
91ef484
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
49c5e68
update multi-target
Feb 19, 2025
b30450b
add default values to alpha and gamma
Feb 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 0 additions & 47 deletions docs/source/advanced/advanced-usages.rst

This file was deleted.

73 changes: 73 additions & 0 deletions docs/source/advanced/imbalanced-labels.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
.. _imbalanced_labels:

Deal with Imbalanced Labels in Classification/Regression
==========================================

In some cases, the number of labels of different classes could be imbalanced, i.e., some classes
have either too many or too few data points. For example, most fraud detection tasks only have a
small number of fraudulent activities (positive labels) versus a huge number of legitimate activities
(negative labels). Even in regression tasks, it is possible to encounter many dominant values that
can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression
model performance a lot. For example, because too many negative labels are fit into models, models
may learn to classify all unseen samples as negative. GraphStorm
provides several ways to tackle the class imbalance problem.

For classification tasks, users can configure two arguments in command line interfaces (CLIs), the
``imbalance_class_weights`` and ``class_loss_func``.

The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models
to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus
90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually
for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``.
This places more importance on correctly classifying positive samples and less on negative ones. Below
is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file.

.. code-block:: yaml

imbalance_class_weights: 0.1,0.9

You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the
`focal loss function <https://arxiv.org/abs/1708.02002>`_ in binary classification tasks. The focal loss
function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`,
where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output
in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`,
corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help
update models on harder cases so as to detect more positive samples if the positive to negative ratio is small.
There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then
search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file.

.. code-block:: yaml

class_loss_func: focal

gamma: 10.0
alpha: 0.5

Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative
classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs
classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using
``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use
different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch
more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may
use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how
to set the ``return_proba`` in a YAML configuration file.

.. code-block:: yaml

return_proba: true

For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the
`shrinkage loss function <https://openaccess.thecvf.com/content_ECCV_2018/html/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.html>`_,
which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is
:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference
between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters.
You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage
loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is
an example about how to set the `shrinkage loss function` in a YAML configuration file.

.. code-block:: yaml

regression_loss_func: shrinkage

gamma: 0.2
alpha: 5
196 changes: 196 additions & 0 deletions docs/source/advanced/multi-target-ntypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
.. _multi_target_ntypes:

Multiple Target Node Types Training
===================================

When training on a heterogeneous graph, we often need to train a model by minimizing the objective
function on more than one node type. GraphStorm provides supports to achieve this goal. The recommended
method is to leverage GraphStorm's multi-task learning method, i.e., using multiple node tasks, and each
trained on one target node type.

More detailed guide of using multi-task learning can be found in
:ref:`Multi-task Learning in GraphStorm<multi_task_learning>`. This guide provides two examples of how
to conduct two target node type classification training on the `movielen 100k <https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset>`_
data, where the **movie** ("item" in the original data) and **user** node types have classification
labels associated.

Using multi-task learning for multiple target node types training (Recommended)
--------------------------------------------------------------------------------

Preparing the training data
............................

During graph construction step, you can define two classification tasks on the two node types as
shown in the JSON example below.

.. code-block:: json

{
"version": "gconstruct-v0.1",
"nodes": [
{
"node_type": "movie",
......
],
"labels": [
{
"label_col": "label_movie",
"task_type": "classification",
"split_pct": [0.8, 0.1, 0.1],
"mask_field_names": ["train_mask_movie",
"val_mask_movie",
"test_mask_movie"]
},
]
},
{
"node_type": "user",
......
],
"labels": [
{
"label_col": "label_user",
"task_type": "classification",
"split_pct": [0.2, 0.2, 0.6],
"mask_field_names": ["train_mask_user",
"val_mask_user",
"test_mask_user"]
},
]
},
],
......
}

The above configuration defines two classification tasks for the **movie** nodes and **user** nodes, respectively.
Each node type has its own "lable_col" and train/validation/test mask fields associated. Then you can
follow the instructions in :ref:`Run graph construction<run-graph-construction>` to use the GraphStorm
construction tool for creating partitioned graph data.

Define multi-task for model training
...............................

Now, you can specify two training tasks by providing the `multi_task_learning` configurations in
the training configuration YAML file, like the example below.

.. code-block:: yaml

---
version: 1.0
gsf:
basic:
...
multi_task_learning:
- node_classification:
target_ntype: "movie"
label_field: "label_movie"
mask_fields:
- "train_mask_movie"
- "val_mask_movie"
- "test_mask_movie"
num_classes: 10
task_weight: 0.5
- node_classification:
target_ntype: "user"
label_field: "label_user"
mask_fields:
- "train_mask_user"
- "val_mask_user"
- "test_mask_user"
task_weight: 1.0
...

The above configuration defines one classification task for the **movie** node type and another one
for the **user** node type. The two node classification tasks will take their own label name, i.e.,
`label_movie` and `label_user`, and their own train/validation/test mask fields. It also defines
which prioritizes user node classification (task_weight = 1.0) over movie node classification (task_weight = 0.5).
(`task_weight = 1.0`) than classification on **movie** nodes (`task_weight = 0.5`).

Run multi-task model training
..............................

You can use the `graphstorm.run.gs_multi_task_learning` command to run multi-task learning tasks,
like the following example.

.. code-block:: bash

python -m graphstorm.run.gs_multi_task_learning \
--workspace <PATH_TO_WORKSPACE> \
--num-trainers 1 \
--num-servers 1 \
--part-config <PATH_TO_GRAPH_DATA> \
--cf <PATH_TO_CONFIG> \

Run multi-task model Inference
...............................

For inference, you can use the same command line `graphstorm.run.gs_multi_task_learning` with an
additional argument `--inference` as the following:

.. code-block:: bash

python -m graphstorm.run.gs_multi_task_learning \
--inference \
--workspace <PATH_TO_WORKSPACE> \
--num-trainers 1 \
--num-servers 1 \
--part-config <PATH_TO_GRAPH_DATA> \
--cf <PATH_TO_CONFIG> \
--save-prediction-path <PATH_TO_OUTPUT>

The prediction results of each prediction tasks will be saved into different sub-directories under
<PATH_TO_OUTPUT>. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.

Using multi-target node type training (Not Recommended)
-------------------------------------------------------

You can also use GraphStorm's multi-target node types configuration. But this method is less
flexible than the multi-task learning method.

- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config
YAML file to minimize the objective function defined on mutiple target node types. For example,
by setting ``target_ntype`` as following, we can jointly optimize the objective function defined
on "movie" and "user" node types.

.. code-block:: yaml

target_ntype:
- movie
- user

- During evaluation, the users need to choose a single node type. For example, by setting
``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. GraphStorm
only supports evaluating on a single node type.

- Per target node type decoder: The users may also want to use a different decoder on each node type,
where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes``
in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will
create a decoder with an output dimension as 3 for movie node type, and a decoder with an output
dimension as 7 for user node type.

.. code-block:: yaml

num_classes:
movie: 3
user: 7

- Reweighting on loss function: The users may also want to use a customized loss function reweighting
on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and
``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does
not support different node types with different ``multilabel`` setting.

.. code-block:: yaml

multilabel:
movie: true
user: true
multilabel_weights:
movie: 0.1,0.2,0.3
user: 0.1,0.2,0.3,0.4,0.5,0.0

multilabel:
movie: false
user: false
imbalance_class_weights:
movie: 0.1,0.2,0.3
user: 0.1,0.2,0.3,0.4,0.5,0.0
14 changes: 8 additions & 6 deletions docs/source/advanced/multi-task-learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -277,15 +277,15 @@ You can define an edge feature reconstruction task as the following example:
eval_metric:
- "mse"

In the configuration, `target_etype` defines the target edge type to which the reconstruct edge feature
learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
In the configuration, `target_etype` defines the target edge type to which the reconstruct edge
feature learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
reconstructed. The other configs are same as edge regression tasks.


Run Model Training
~~~~~~~~~~~~~~~~~~~
GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` with an additional
argument `--inference` to run multi-task learning tasks. You can use the following command to start a multi-task training job:
GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` to run multi-task
learning tasks. You can use the following command to start a multi-task training job:

.. code-block:: bash

Expand All @@ -298,7 +298,8 @@ argument `--inference` to run multi-task learning tasks. You can use the followi

Run Model Inference
~~~~~~~~~~~~~~~~~~~~
You can use the same command line `graphstorm.run.gs_multi_task_learning` to run inference as following:
You can use the same command line `graphstorm.run.gs_multi_task_learning` with an additional
argument `--inference` to run inference as following:

.. code-block:: bash

Expand All @@ -312,7 +313,8 @@ You can use the same command line `graphstorm.run.gs_multi_task_learning` to run
--save-prediction-path <PATH_TO_OUTPUT>

The prediction results of each prediction tasks (node classification, node regression,
edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT.
The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.

Run Model Training on SageMaker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
12 changes: 6 additions & 6 deletions docs/source/cli/model-training-inference/configuration-run.rst
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,14 @@ General Configurations
- For link prediction tasks, the default value is ``mrr``.
- **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression.

- Yaml: ``gamma: 10.0``
- Argument: ``--gamma 10.0``
- Default value: None
- Yaml: ``gamma: 2.0``
- Argument: ``--gamma 2.0``
- Default value: ``2.0`` in focal loss function; ``0.2`` in shrinkage loss function; ``12.0`` in ``distmult``, ``RotatE``, and ``TransE`` link prediction decoders.
- **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression.

- Yaml: ``alpha: 10.0``
- Argument: ``--alpha 10.0``
- Default value: None
- Yaml: ``alpha: 0.25``
- Argument: ``--alpha 0.25``
- Default value: ``0.25`` in focal loss function; ``10.0`` in shrinkage loss function.

Classification and Regression Task
```````````````````````````````````
Expand Down
Loading