You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we want to migrate to the CUDA backend from the GPU/OpenCL backend but have found that the CUDA backend uses significantly more GPU memory than the OpenCL backend, leading to errors running out of GPU memory in some cases.
Eg. this test code trains 100 trees and monitors the maximum GPU memory used:
The training dataset has 5,000,000,000 total floats, and when quantized these should be able to be stored in one byte, requiring 5 GB for the full training dataset.
I built LightGBM at version 4.6.0 with CUDA and OpenCL support.
Running this using an A100 GPU uses 11,580 MiB, but if I change device_type to gpu, it uses 7,812 MiB.
I tested the effect of the max_bin parameter. The memory used is the same at max_bin = 255, but if I reduce this to 15, OpenCL only requires 2,890 MiB, but CUDA still needs 11,580 MiB. I tested increasing this beyond 256 which raises an error with the OpenCL backend, but with the CUDA backend the GPU memory jumped to 20,538 MiB. I would have expected an increase in 5 GB if an extra byte was used for each training value, but it seems like there's nearly an additional 2 bytes per training value?
I wondered if the subsample gets materialised and stored separately on the GPU, but changing the subsample size doesn't seem to affect GPU memory usage.
I also noticed that the CUDA backend only supports double precision, but the OpenCL backend uses single precision floats by default. But changing OpenCL to use double precision with gpu_use_dp only slightly increased memory use.
Is this expected behaviour and is there a known reason why the CUDA backend uses so much more memory?
The text was updated successfully, but these errors were encountered:
It is true that with device_type=cuda, more memory is used. Because we adopt a more efficient way to compute the histograms with device_type=cuda. As a cost, we store a separate copy of the discretized data in GPU memory, which should double the memory cost for discretized data storage.
We may consider enabling a memory-efficient mode (but perhaps slightly slower) version in the future.
Hi, we want to migrate to the CUDA backend from the GPU/OpenCL backend but have found that the CUDA backend uses significantly more GPU memory than the OpenCL backend, leading to errors running out of GPU memory in some cases.
Eg. this test code trains 100 trees and monitors the maximum GPU memory used:
The training dataset has 5,000,000,000 total floats, and when quantized these should be able to be stored in one byte, requiring 5 GB for the full training dataset.
I built LightGBM at version 4.6.0 with CUDA and OpenCL support.
Running this using an A100 GPU uses 11,580 MiB, but if I change
device_type
togpu
, it uses 7,812 MiB.I tested the effect of the
max_bin
parameter. The memory used is the same atmax_bin = 255
, but if I reduce this to 15, OpenCL only requires 2,890 MiB, but CUDA still needs 11,580 MiB. I tested increasing this beyond 256 which raises an error with the OpenCL backend, but with the CUDA backend the GPU memory jumped to 20,538 MiB. I would have expected an increase in 5 GB if an extra byte was used for each training value, but it seems like there's nearly an additional 2 bytes per training value?I wondered if the subsample gets materialised and stored separately on the GPU, but changing the subsample size doesn't seem to affect GPU memory usage.
I also noticed that the CUDA backend only supports double precision, but the OpenCL backend uses single precision floats by default. But changing OpenCL to use double precision with
gpu_use_dp
only slightly increased memory use.Is this expected behaviour and is there a known reason why the CUDA backend uses so much more memory?
The text was updated successfully, but these errors were encountered: