Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with running on GPU #15

Open
pflashgary opened this issue Apr 9, 2021 · 5 comments
Open

Trouble with running on GPU #15

pflashgary opened this issue Apr 9, 2021 · 5 comments

Comments

@pflashgary
Copy link

pflashgary commented Apr 9, 2021

Hi there,
Thanks for making this work publicly available.
I managed to run your code for my own dataset on CPU but my attempt to run it on GPU hasn't worked yet due to simple_bind error. For what it's worth, I'm running this on an EC2 instance with GPUs and Deep Learning AMI (mxnet p3.6 and CUDA 10.0). Wondering if you've seen this issue before and any clues?

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1832, in simple_bind
    ctypes.byref(exe_handle)))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: _Map_base: :at

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sst_main.py", line 432, in <module>
    train(args)
  File "sst_main.py", line 330, in train
    factory=sst_nowcasting, context=args.ctx
  File "/home/ubuntu/STS-ConvLSTM/nowcasting/encoder_forecaster.py", line 575, in encoder_forecaster_build_networks
    shared_module=shared_encoder_net,
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/module.py", line 429, in bind
    state_names=self._state_names)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 280, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 384, in bind_exec
    shared_group))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 678, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1838, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (5, 4, 1, 480, 480)
ebrnn1_begin_state_h: (4, 64, 96, 96)
ebrnn2_begin_state_h: (4, 192, 32, 32)
ebrnn3_begin_state_h: (4, 192, 16, 16)
_Map_base: :at
@sxjscience
Copy link
Owner

Seems to be related to the latest MXNet

@sxjscience
Copy link
Owner

@pflashgary Thanks for the question, I haven't run the source code for a while and the bug seems to be related to MXNet. Which version of MXNet are you currently using?

@pflashgary
Copy link
Author

Hi Xingjian, Thanks for getting back to me; mxnet p3.6 and CUDA 10.0. Can I ask your version of mxnet and CUDA so that I can compare?

@sulisetyowidodo
Copy link

Hi
Does this problem have a solution?

Thanks

@sxjscience
Copy link
Owner

@pflashgary and @sulisetyowidodo I currently do not have bandwidth to check which version of MXNet works for the latest CUDA.

We have switched the development to PyTorch and you may check our latest Earthformer paper: https://github.com/amazon-science/earth-forecasting-transformer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants