Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: `predict.Rborist with nTree argument #39

Open
hadjipantelis opened this issue Mar 10, 2018 · 5 comments
Open

Feature request: `predict.Rborist with nTree argument #39

hadjipantelis opened this issue Mar 10, 2018 · 5 comments

Comments

@hadjipantelis
Copy link

I have a feature request: Would it be possible to have predict.Rborist to accept an nTree argument similar to the standard Rborist? Given that a forest has N trees, we should be able to provide predictions using N-M trees.

This functionality can be useful to see if/when additional tress lead to over-fitting.

@suiji
Copy link
Owner

suiji commented Mar 10, 2018

This could easily be slipped into the upcoming version, but why not just retrain with n-m trees?

Prediction could be parametrized with a logical vector, for example, with 'n-m' entries of 'true' and the remaining 'm' entries 'false'. It seems, though, that these entries would need to be chosen at random, then applied iteratively, in order to get a good sense of over-fitting.

@hadjipantelis
Copy link
Author

Thank you for the speedy response.

Wouldn't retraining a new RF require additional time? (and space)

The use-case I am thinking of is that as with a GBM one can select the number of iterations, one could do something similar with an RF.
I appreciate that the tree order is irrelevant in the case of RF so what you describe with the use of logical vector is probably an ideal scenario but for a quick check just having the first N-M entries set to TRUE and the final M to FALSE is probably fine. We are bootstrapping the original data anyway. Just using the first N-M trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.

@suiji
Copy link
Owner

suiji commented Mar 12, 2018

Wouldn't retraining a new RF require additional time? (and space)

Assuming a "moderate" number of predictors, it takes (very roughly) about ten times as long to train as to predict. So, yes, resampling from the same forest will be faster than retraining each time. The results will not be identical to to those obtained through retraining, but they may be suitable for your purposes. Unless forest are retained following prediction, though, there should not be a memory penalty.

just having the first N-M entries set to TRUE and the final M to FALSE is probably fine.

Yes, but a new feature like this should be sufficiently general to support a variety of use cases.

Just using the first N-M trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.

I may be missing your point, but initializing an index vector seems like a two- or three-line operation at worst.

@hadjipantelis
Copy link
Author

  1. Cool, we are in agreement on that.
  2. Sure thing! I am mostly thinking what would be the most straightforward interface to the user.
  3. Agreed but I was mostly thinking of the overheard of accessing the trees. I assume they are stored sequentially in memory so we will have "unit-stride" if we just used the first N-M trees rather than random access. Granted, it is a minor point!

@suiji
Copy link
Owner

suiji commented Mar 14, 2018

assume they are stored sequentially in memory so we will have "unit-stride" if we just used the first N-M trees rather than random access

Trees are stored sequentially but their sizes are not uniform so, in particular, stride is not fixed. In any case, bagging already introduces a precedent for ignoring a given tree at a given row. The feature you propose generalizes this a bit, when prediction is not bagged, by ignoring a given tree at all rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants