[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

Liu-rj · 2023-02-07T03:51:32Z

Backward of SpMM and SDDMM is supported in branch dev_spmm.
However, In pass sampler, the backward of gs.ops.u_mul_v(subA, u_feats @ W_2, v_feats @ W_2), i.e. (dX = gspmm(_gidx, "mul", "sum", Y, dZ, rev_format)), produces nan values while its inputs have no nan vlaues.

To reproduce:

$ git checkout origin/dev_spmm
$ build and install the project
$ cd examples/pass
$ python train_minibatch.py
Namespace(device='cuda', use_uva=None, dataset='reddit', batchsize=512, samples='10,10', num_workers=0)
Graph(num_nodes=232965, num_edges=114848857,
      ndata_schemes={}
      edata_schemes={})
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20230207 03:35:00.528640 18177 graph.cc:19] Loaded CSC with 232965 nodes and 114848857 edges
Check load successfully: [None, None, tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:0'), tensor([        0,      2205,      2360,  ..., 114848225, 114848365,
        114848857], device='cuda:0'), tensor([225202, 177307, 107546,  ..., 232594, 232634, 232964], device='cuda:0')] 

memory allocated before training: 2.2396583557128906 GB
  0%|                                                                                             | 0/300 [00:00<?, ?it/s]
/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py:148: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py:156: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in GSDDMMBackward. Traceback of forward call that caused the error:
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 247, in <module>
    train(dataset, args)
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 125, in train
    input_nodes, output_nodes, blocks, loss_tuple = compiled_func(
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 33, in matrix_sampler
    att2 = torch.sum(gs.ops.u_mul_v(subA, u_feats @ W_2,
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sddmm.py", line 115, in func
    return gsddmm(g, binary_op, x, y,
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sddmm.py", line 72, in gsddmm
    return gsddmm_internal(
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sparse.py", line 286, in gsddmm
    return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target, on_format)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  0%|                                                                                             | 0/300 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 247, in <module>
    train(dataset, args)
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 157, in train
    sample_loss.backward()
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'GSDDMMBackward' returned nan values in its 0th output.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

Liu-rj commented Feb 7, 2023 •

edited

Loading

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

Comments

Liu-rj commented Feb 7, 2023 • edited Loading

Liu-rj commented Feb 7, 2023 •

edited

Loading