Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #225

Open
Iron486 opened this issue Nov 3, 2024 · 1 comment
Open

CUDA error: an illegal memory access was encountered #225

Iron486 opened this issue Nov 3, 2024 · 1 comment

Comments

@Iron486
Copy link

Iron486 commented Nov 3, 2024

Hi!

I am running the following command to train from scratch:
python train_full_pipeline.py -s /home/farchid/research/project_1/south-building -r "dn_consistency" --high_poly True --export_obj True

where south-building is one of the COLMAP datasets downloaded from here. The problem now is that I get these errors:

This is the output:

Using high poly config.
Will export a UV-textured mesh as an .obj file.
Will export a ply file with the refined 3D Gaussians at the end of the training.
Optimizing output/vanilla_gs/south-building/
Output folder: output/vanilla_gs/south-building/ [03/11 18:34:56]
Tensorboard not available: not logging progress [03/11 18:34:56]
Reading camera 128/128 [03/11 18:34:57]
Loading Training Cameras [03/11 18:34:57]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [03/11 18:34:57]
Loading Test Cameras [03/11 18:35:15]
Number of points at initialisation :  61342 [03/11 18:35:15]
Training progress:  16%|███████████████████████████▋                                                                                                                                                    | 1100/7000 [00:20<01:43, 57.04it/s, Loss=0.5989216]Traceback (most recent call last):
  File "/home/farchid/research/project_1/SuGaR/./gaussian_splatting/train.py", line 220, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "/home/farchid/research/project_1/SuGaR/./gaussian_splatting/train.py", line 87, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, bg)
  File "/home/farchid/research/project_1/SuGaR/gaussian_splatting/gaussian_renderer/__init__.py", line 99, in render
    "visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:  16%|███████████████████████████▋                                                                                                                                                    | 1100/7000 [00:20<01:50, 53.19it/s, Loss=0.5989216]
Using original 3DGS rasterizer from Inria.
Using high poly config.
Will export a UV-textured mesh as an .obj file.
Will export a ply file with the refined 3D Gaussians at the end of the training.
Changing sh_levels to match the loaded model: 4
-----Parsed parameters-----
Source path: /home/farchid/research/project_1/south-building
   > Content: 6
Gaussian Splatting checkpoint path: output/vanilla_gs/south-building/
   > Content: 3
SUGAR checkpoint path: ./output/coarse/south-building/sugarcoarse_3Dgs7000_densityestim02_sdfnorm02/
Iteration to load: 7000
Output directory: ./output/coarse/south-building
Depth-Normal consistency factor: 0.05
SDF estimation factor: 0.2
SDF better normal factor: 0.2
Eval split: True
White background: False
---------------------------
Using device: 0
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| GPU reserved memory   |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|


Loading config output/vanilla_gs/south-building/...
Performing train/eval split...
Found image extension .JPG
Traceback (most recent call last):
  File "/home/farchid/research/project_1/SuGaR/train.py", line 133, in <module>
    coarse_sugar_path = coarse_training_with_density_regularization_and_dn_consistency(coarse_args)
  File "/home/farchid/research/project_1/SuGaR/sugar_trainers/coarse_density_and_dn_consistency.py", line 377, in coarse_training_with_density_regularization_and_dn_consistency
    nerfmodel = GaussianSplattingWrapper(
  File "/home/farchid/research/project_1/SuGaR/sugar_scene/gs_model.py", line 162, in __init__
    self.gaussians.load_ply(
  File "/home/farchid/research/project_1/SuGaR/gaussian_splatting/scene/gaussian_model.py", line 216, in load_ply
    plydata = PlyData.read(path)
  File "/home/farchid/anaconda3/envs/sugar/lib/python3.9/site-packages/plyfile.py", line 401, in read
    (must_close, stream) = _open_stream(stream, 'read')
  File "/home/farchid/anaconda3/envs/sugar/lib/python3.9/site-packages/plyfile.py", line 481, in _open_stream
    return (True, open(stream, read_or_write[0] + 'b'))
FileNotFoundError: [Errno 2] No such file or directory: 'output/vanilla_gs/south-building/point_cloud/iteration_7000/point_cloud.ply'

How can I solve this issue? I have CUDA 11.8 installed, and I am using an RTX 4090 with 24 GB VRAM. The error appears when the training progress bar is at 16%.

Thank you!

@xjxinchina
Copy link

I encountered the exact same issue as you, and I am using exactly the same data. However, when I switched to using data captured by a drone, the error I received was:

RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

After running for a while, it reported insufficient memory, like this:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 684.00 MiB (GPU 0; 23.64 GiB total capacity; 23.02 GiB already allocated; 19.31 MiB free; 23.11 GiB reserved in total by PyTorch). If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants