Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU documentation/status #54

Open
saraemoore opened this issue May 11, 2021 · 2 comments
Open

GPU documentation/status #54

saraemoore opened this issue May 11, 2021 · 2 comments

Comments

@saraemoore
Copy link

Is there documentation or an update available on a GPU-accelerated Rborist (previously referred to elsewhere as "Curborist")? The current Rborist docs state that it is "tuned for multicore and GPU hardware," but I wasn't able to find any details on usage options for GPUs specifically. Thanks in advance for any details you can share!

@suiji
Copy link
Owner

suiji commented May 13, 2021

"Curborist" stood for CUda-enabled Rborist. Repartitioning of the observations was performed on the GPU using a stable partition. The multicore implementation was at least as fast, however, on data sets having fewer than, roughly, ten thousand observations. The most obvious bottlenecks arose from data movement and the fact that we were still performing splitting on the CPU. It is fairly clear how to perform splitting on the GPU using parallel-prefix, even with the additional complication posed by the Random Forests algorithm's variable sampling. This work was not completed, however, as there remained other, more-easily implemented, opportunities to improve performance on the CPU alone.

Having dealt with the difficulties of tracking CUDA across multiple platforms and releases, however, it now appears that a more maintainable approach will be to use the newer features of OpenMP for GPU parallelization, particularly those enabled by versions 5.0 and 5.1. The current plan is to offer a fat binary which will look for a GPU and, upon finding one or more, invoke both repartitioning and splitting on the coprocessor - when it makes sense to do so. In particular, Rborist version 0-3.0 has been extensively refactored to make this possible. Even given this reorganization, though, a truly GPU-capable implementation will not be available before 0-4.0. In particular, compilers supporting the new standards do not appear to be generally available. Last I checked, only Cray and AMD currently offer support. That said, we envision that the only intervention required of the user will be to set the "enableCoprocessor" option to TRUE.

For completeness, I should also point out that there was a proprietary CUDA version developed seven or eight years ago. This was specialized for low-categoricity classification, especially genomic work. It scaled quite nicely with predictor count, with 50x speedup over a bespoke multicore equivalent. The algorithm did not scale beyond roughly one thousand observations, though, and the approach has been abandoned in the open-source versions to follow.

@saraemoore
Copy link
Author

Thanks very much for the very thorough response! Kudos for the preparatory work to make this feature possible. I'll keep an eye out for it in future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants