Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

Open
Liu-rj opened this issue Feb 7, 2023 · 0 comments
Open

[BUG] Backward of SDDMM (u_mul_v) produces nan values #40

Liu-rj opened this issue Feb 7, 2023 · 0 comments

Comments

@Liu-rj
Copy link
Collaborator

Liu-rj commented Feb 7, 2023

Backward of SpMM and SDDMM is supported in branch dev_spmm.
However, In pass sampler, the backward of gs.ops.u_mul_v(subA, u_feats @ W_2, v_feats @ W_2), i.e. (dX = gspmm(_gidx, "mul", "sum", Y, dZ, rev_format)), produces nan values while its inputs have no nan vlaues.

To reproduce:

$ git checkout origin/dev_spmm
$ build and install the project
$ cd examples/pass
$ python train_minibatch.py
Namespace(device='cuda', use_uva=None, dataset='reddit', batchsize=512, samples='10,10', num_workers=0)
Graph(num_nodes=232965, num_edges=114848857,
      ndata_schemes={}
      edata_schemes={})
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20230207 03:35:00.528640 18177 graph.cc:19] Loaded CSC with 232965 nodes and 114848857 edges
Check load successfully: [None, None, tensor([1., 1., 1.,  ..., 1., 1., 1.], device='cuda:0'), tensor([        0,      2205,      2360,  ..., 114848225, 114848365,
        114848857], device='cuda:0'), tensor([225202, 177307, 107546,  ..., 232594, 232634, 232964], device='cuda:0')] 

memory allocated before training: 2.2396583557128906 GB
  0%|                                                                                             | 0/300 [00:00<?, ?it/s]
/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py:148: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py:156: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in GSDDMMBackward. Traceback of forward call that caused the error:
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 247, in <module>
    train(dataset, args)
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 125, in train
    input_nodes, output_nodes, blocks, loss_tuple = compiled_func(
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 33, in matrix_sampler
    att2 = torch.sum(gs.ops.u_mul_v(subA, u_feats @ W_2,
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sddmm.py", line 115, in func
    return gsddmm(g, binary_op, x, y,
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sddmm.py", line 72, in gsddmm
    return gsddmm_internal(
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/gs-0.1-py3.9.egg/gs/ops/sparse.py", line 286, in gsddmm
    return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target, on_format)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  0%|                                                                                             | 0/300 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 247, in <module>
    train(dataset, args)
  File "/home/ubuntu/aws_projects/graph_sampling/examples/pass/train_minibatch.py", line 157, in train
    sample_loss.backward()
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ubuntu/anaconda3/envs/dgl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'GSDDMMBackward' returned nan values in its 0th output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant