-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SyncDataCollector Crashes with Resources Leak During Data Collection #2644
Comments
Maybe there is a common/root cause between #2614 and this issue |
Maybe it's related to my disk storage being too small ? I'm storing stacked frames (4, 3, 84, 84) into my replay buffer which uses |
I have ran into a similar problem before. When I was using torchRL with IsaacLab, I would have training runs die midway through when using
|
Is this what you were running into @AlexandreBrown? rl/torchrl/collectors/collectors.py Lines 1083 to 1090 in f5a187d
|
I think either would work! There shouldn't be a case where one would use multiprocessed collectors with IsaacSim. |
I haven't tried the fix but I'm not opposed to having the option. |
This option would be great. We are using nvidia warp in our custom environment and had synchronisation problems because |
@vmoens Any updates on this? |
The crash occurred again today when training a policy on Maniskill3 (which uses Physx just like Isaac so this might be why I also get an issue just like @yu-fz ) . |
I can confirm that the bug can still occur even with @yu-fz 's SyncDataCollectorWrapper. /home/mila/b/xxx/.conda/envs/maniskill3_env/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Killed Maybe the issue is |
Update : It does not crash with LazyTensorStorage + SyncDataCollector Wrapper. I will try with LazyTensorStorage+ official SyncDataCollector. I suspect the issue happens when I do trainings on a VM with too little free disk space (required by MemmapTensorStorage). |
Thanks so much for looking into it @AlexandreBrown! I guess you're seeing that only when using a RB within the collector then? I will add an arg in SyncDataCollector to bypass the cuda syncs, that should be sufficient to avoid subclassing right @yu-fz and @fyu-bdai ? See #2727 for a solution |
Update : When using the official File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 3289, in cpu
return self.to("cpu", **kwargs)
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10642, in to
self._sync_all()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10739, in _sync_all
torch.cuda.synchronize()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 954, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. And : [2025-01-28 07:53:15.039] [SAPIEN] [critical] Mem free failed with error code 700!
[2025-01-28 07:53:15.039] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-28 07:53:15.039] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-28 07:53:15.039] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-28 07:53:15.039] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-28 07:53:15.040] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
CUDA error at /__w/SAPIEN/SAPIEN/3rd_party/sapien-vulkan-2/src/core/buffer.cpp 103: an illegal memory access was encountered Which suggest that the SyncDataCollector CUDA memory management might cause issues in environments like Isaac Lab and/or Maniskill3. |
I updated the docstrings of So here's the gist:
So I'm going to fall back on asking you guys: If |
Yup! That will be sufficient. |
This is pretty major, I encountered this today during evaluation data collection. for _ in tqdm(range(nb_iters), "Evaluation"):
rollouts = self.eval_env.rollout(
max_steps=self.env_max_frames_per_traj,
policy=policy,
auto_reset=False,
auto_cast_to_device=True,
tensordict=tensordict,
).cpu(non_blocking=False)
It seems like a sync is non-avoidable if it leads to corrupted data that crashes training. This makes cuda support for Maniskill impossible for me right now... |
Sorry @AlexandreBrown I don't really get it: does the new feature introduce more bugs? I don't understand what the problem is precisely |
Hi @vmoens , no I don't think the new issue introduces more bugs or at least that's not what I encountered. for _ in tqdm(range(nb_iters), "Evaluation"):
rollouts = self.eval_env.rollout(
max_steps=self.env_max_frames_per_traj,
policy=policy,
auto_reset=False,
auto_cast_to_device=False,
tensordict=tensordict,
).to(device="cpu", non_blocking=False) Stacktrace: Traceback (most recent call last):
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
cli.main()
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "scripts/train_rl.py", line 118, in <module>
main()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "scripts/train_rl.py", line 107, in main
trainer.train()
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/trainers/rl_trainer.py", line 90, in train
eval_metrics = self.evaluator.evaluate(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 147, in evaluate
eval_metrics = self.log_eval_metrics(agent, env_step)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 158, in log_eval_metrics
eval_metrics = self.gather_eval_rollouts_metrics(policy)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 171, in gather_eval_rollouts_metrics
rollouts = self.eval_env.rollout(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in to
tensors = [to(t) for t in tensors]
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in <listcomp>
tensors = [to(t) for t in tensors]
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10595, in to
return tensor.to(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2025-01-30 19:23:26.032] [SAPIEN] [critical] Mem free failed with error code 700!
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.033] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
CUDA error at /__w/SAPIEN/SAPIEN/3rd_party/sapien-vulkan-2/src/core/buffer.cpp 103: an illegal memory access was encountered This occurs when the Maniskill env is using cuda. |
Ah ok I thought it was the same issue. Your policy is on cuda? |
@vmoens You are right, I should open another issue, thank you |
Describe the bug
I've observed that lateset trainings crash after 180k steps with the following message :
To Reproduce
Expected behavior
No crash
Screenshots
If applicable, add screenshots to help explain your problem.
System info
output :
Checklist
The text was updated successfully, but these errors were encountered: