Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Add Imbalance label guide and reorg #1176

Merged
merged 33 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7ea7a41
init adv imbalance doc
Feb 14, 2025
6b4e513
1st version
Feb 15, 2025
7c0c549
1st version
Feb 15, 2025
a99abde
reorg advanced topic
Feb 17, 2025
6bac3c6
enhance index page
Feb 17, 2025
f7decc2
add examples in imbalance
Feb 17, 2025
f13130f
refine
Feb 18, 2025
84bbd86
break lines
Feb 18, 2025
14916ab
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
ba3f2af
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
bc095eb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
cf1a394
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 18, 2025
bc383f5
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
c1728eb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
04bb099
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
42ff53a
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
bfacd07
Update docs/source/index.rst
zhjwy9343 Feb 18, 2025
f324aaf
Update docs/source/index.rst
zhjwy9343 Feb 18, 2025
a155edb
Update docs/source/advanced/imbalanced-labels.rst
zhjwy9343 Feb 18, 2025
1324d6a
change contents
Feb 18, 2025
ae449eb
rewrite multiple target node
Feb 19, 2025
354b282
rewrite multiple target node
Feb 19, 2025
e566761
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
9687713
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
c7c2de8
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
f90e323
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
0c0ef4b
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
3e185a4
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
7254206
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
e0408d5
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
91ef484
Update docs/source/advanced/multi-target-ntypes.rst
zhjwy9343 Feb 19, 2025
49c5e68
update multi-target
Feb 19, 2025
b30450b
add default values to alpha and gamma
Feb 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/source/advanced/imbalanced-labels.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
.. _imbalanced_labels:

Deal with Imbalanced Labels in Classification/Regression
==========================================

In some cases, the number of labels of different classes could be imbalanced, i.e., some classes
have either too many or too few data points. For example, most fraud detection tasks only have a
small number of fraudulent activities (positive labels) versus a huge number of legitimate activities
(negative labels). Even in regression tasks, it is possible to encounter many dominant values that
can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression
model performance a lot. For example, because too many negative labels are fit into models, models
may learn to classify all unseen samples as negative. GraphStorm
provides several ways to tackle the class imbalance problem.

For classification tasks, users can configure two arguments in command line interfaces (CLIs), the
``imbalance_class_weights`` and ``class_loss_func``.

The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models
to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus
90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually
for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``.
This places more importance on correctly classifying positive samples and less on negative ones. Below
is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file.

.. code-block:: yaml

imbalance_class_weights: 0.1,0.9

You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the
`focal loss function <https://arxiv.org/abs/1708.02002>`_ in binary classification tasks. The focal loss
function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`,
where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output
in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`,
corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help
update models on harder cases so as to detect more positive samples if the positive to negative ratio is small.
There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then
search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file.

.. code-block:: yaml

class_loss_func: focal

gamma: 10.0
alpha: 0.5

Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative
classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs
classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using
``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use
different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch
more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may
use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how
to set the ``return_proba`` in a YAML configuration file.

.. code-block:: yaml

return_proba: true

For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the
`shrinkage loss function <https://openaccess.thecvf.com/content_ECCV_2018/html/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.html>`_,
which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is
:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference
between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters.
You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage
loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is
an example about how to set the `shrinkage loss function` in a YAML configuration file.

.. code-block:: yaml

regression_loss_func: shrinkage

gamma: 0.2
alpha: 5
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
.. _advanced_usages:

GraphStorm Advanced Usages
===========================
.. _multi_target_ntypes:

Multiple Target Node Types Training
-------------------------------------
===================================

When training on a hetergenious graph, we often need to train a model by minimizing the objective function on more than one node type. GraphStorm provides supports to achieve this goal.
When training on a heterogeneous graph, we often need to train a model by minimizing the objective function on more than one node type. GraphStorm provides supports to achieve this goal.

- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config YAML file to minimize the objective function defined on mutiple target node types. For example, by setting ``target_ntype`` as following, we can jointly optimize the objective function defined on "movie" and "user" node types.

Expand Down
12 changes: 6 additions & 6 deletions docs/source/cli/model-training-inference/configuration-run.rst
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,14 @@ General Configurations
- For link prediction tasks, the default value is ``mrr``.
- **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression.

- Yaml: ``gamma: 10.0``
- Argument: ``--gamma 10.0``
- Default value: None
- Yaml: ``gamma: 2.0``
- Argument: ``--gamma 2.0``
- Default value: ``2``
- **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression.

- Yaml: ``alpha: 10.0``
- Argument: ``--alpha 10.0``
- Default value: None
- Yaml: ``alpha: 0.25``
- Argument: ``--alpha 0.25``
- Default value: ``0.25``

Classification and Regression Task
```````````````````````````````````
Expand Down
17 changes: 10 additions & 7 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Welcome to the GraphStorm Documentation and Tutorials

.. toctree::
:maxdepth: 2
:caption: Advanced Topics
:caption: Practical & Advanced Guides
:hidden:
:glob:

Expand All @@ -44,11 +44,12 @@ Welcome to the GraphStorm Documentation and Tutorials
advanced/link-prediction
advanced/advanced-wholegraph
advanced/multi-task-learning
advanced/advanced-usages
advanced/using-graphbolt
advanced/multi-target-ntypes
advanced/imbalanced-labels
advanced/gsprocessing-emr-ec2

GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billons of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customiz model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability.
GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billions of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customize model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability.

Getting Started
----------------
Expand Down Expand Up @@ -83,16 +84,18 @@ The released GraphStorm APIs list the major components that can help users to de

To help users use these APIs, GraphStorm also released a set of Jupyter notebooks at :ref:`GraphStorm API Programming Example Notebooks<programming-examples>`. By running these notebooks, users can explore some APIs, learn how to use APIs to reproduce CLIs pipelines, and then customize GraphStorm components for specific requirements.

Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. For unrelease APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository <https://github.com/awslabs/graphstorm/issues>`_.
Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. For unreleased APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository <https://github.com/awslabs/graphstorm/issues>`_.

Advanced Topics
----------------
Practical and Advanced Guides
------------------------------

- For users who want to use their own GML models in GraphStorm, follow the :ref:`Use Your Own GNN Models<use-own-models>` tutorial to learn the programming interfaces and the steps of how to modify users' own models.
- For users who want to leverage language models on nodes with text features, follow the :ref:`Use Language Model in GraphStorm<language_models>` tutorial to learn how to leverage BERT models to use text as node features in GraphStorm.
- There are various usages of GraphStorm to both speed up training process and help to boost model performance for link prediction tasks. Users can find these usages in the :ref:`Link Prediction Learning in GraphStorm<link_prediction_usage>` page.
- GraphStorm team has been working with NVIDIA team to integrate the NVIDIA's WholeGraph library into GraphStorm for speed-up of feature copy. Users can follow the :ref:`Use WholeGraph in GraphStorm<advanced_wholegraph>` tutorial to know more details.
- In v0.3, GraphStorm releases an experimental feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm<multi_task_learning>` tutorial to know more details.
- Since v0.3, GraphStorm releases the feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm<multi_task_learning>` tutorial to know more details.
- Since v0.4, GraphStorm adds support for GraphBolt stochastic training. GraphBolt is a new data loading module for DGL that enables faster and more efficient graph sampling, potentially leading to significant efficiency benefits. For detailed use pf GraphBolt in GraphStorm, follow the :ref:`Using GraphBolt to speed up training and inference<using-graphbolt-ref>` guide.
- For questions users asked frequently, there are several guides. The :ref:`Multiple Target Node Types Training<multi_target_ntypes>` document provides guides of using multiple target node types in training. The :ref:`Deal with Imbalance Labels in Classification/Regression<imbalanced_labels>` guide lists several built-in features that can help to tackle challenge of imbalanced labels. If users want to use their own AWS EMR for graph processing, the :ref:`Running distributed graph processing on customized EMR-on-EC2 clusters<gsprocessing_emr_ec2_customized_clusters>` guide provides more details.

Contribution
-------------
Expand Down