Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oob_prediction_ in RandomForestClassifier #267

Open
WZheng-94 opened this issue Mar 1, 2018 · 12 comments
Open

oob_prediction_ in RandomForestClassifier #267

WZheng-94 opened this issue Mar 1, 2018 · 12 comments

Comments

@WZheng-94
Copy link

WZheng-94 commented Mar 1, 2018

Does any one know how to get error rates for each category of a binary variable in RandomForestClassifier?
I found out obb_prediction_ seemed to be exclusive to RandomForestRegressor.

Thanks.

@khan1792
Copy link

khan1792 commented Mar 2, 2018

Combine the y values and the oob prediction values into a data frame
Split it into two groups (prediction >=0.5 or not)
Respectively calculate the Bayes error rate for two groups

@WZheng-94
Copy link
Author

But the problem set asks us to use RandomForestClassifier?

@khan1792
Copy link

khan1792 commented Mar 3, 2018

I mistakenly used the RFregression. But the calculation method remain the same. I just took one more step to transform the prediction numeric values as 1 or 0, while RFclassifier would directly produce them.

@WZheng-94
Copy link
Author

My problem is I think RFclassifier doesn't have oob_prediction_ attribute. I hope it's somewhere I missed.

@khan1792
Copy link

khan1792 commented Mar 3, 2018

Ahhhh, I see... haven’t tried it...

@khan1792
Copy link

khan1792 commented Mar 3, 2018

I have checked the original code of sklearn package. In RandomForestClassifier, we can use oob_decision_function_ to calculate the oob prediction.

  1. Transpose the matrix produced by oob_decision_function_
  2. Select the second row of the matrix
  3. Set a cutoff and transform all decimal values as 1 or 0 (>= 0.5 is 1 and otherwise 0)
    The list of values we finally get is the oob prediction.

@WZheng-94
Copy link
Author

WZheng-94 commented Mar 3, 2018 via email

@AlexanderTyan
Copy link

AlexanderTyan commented Mar 5, 2018

Cross checking against the way sklearn calculates RandomForestClassifier().oob_score_, I believe cutoff for step 3 should be:
(>0.5 is 1 and <=0 is 0). This may impact MSE score calculations because there are a number of edge cases where 0.5=0.5. Sklearn seems to decide conservative in favor of 0's.

@khan1792
Copy link

khan1792 commented Mar 5, 2018

Forget the last message...
Sure, I didn't realize there is a 0.5 in the oob prediction. The oob_score_ classifies 0.5 as 0 and I classified it as 1. This cutoff is tricky in small data, although the classification decision is arbitrary.

@AlexanderTyan
Copy link

Right, I think 0.5 IS quite arbitrary on the part of sklearn. I just went along with it for MSE calculations because if that's what sklearn decides to come up with predicted y values, it would consistent to evaluate MSE according to the same thresholds.

@AlexanderTyan
Copy link

Looking to why the 0.5, I found:
https://books.google.com/books?id=bRpYDgAAQBAJ&pg=PT278&lpg=PT278&dq=sklearn+oob_decision_function_&source=bl&ots=h3VyS5rM9K&sig=Yw0RkHTA4whxi8E5teWMiF6bX6o&hl=en&sa=X&ved=0ahUKEwj-rfu519TZAhWl24MKHf-7D8YQ6AEIjQEwCA#v=onepage&q=sklearn%20oob_decision_function_&f=false

It seems that the way sklearn decides if category 0 or 1 is it looks at the relative probabilities of an observation is 0 or 1 in the the oob_decision_function_ matrix. Of second value is larger than the first, it decides in favour of 1. Since probability adds up to one, this effectively means the 0.5 threshold.

@khan1792
Copy link

khan1792 commented Mar 5, 2018

Most classifiers decide the value according to this kind of probability. Theoritically the thread point should not be classified as 1 or 0 because it will make both sides unbalanced. But practically it’s not a problem no matter it is classified as 1 or 0. As long as the sample size is large enough and the number of tree we set in the model is also large, It’s effect on the result will decrease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants