oob_prediction_ in RandomForestClassifier #267

WZheng-94 · 2018-03-01T23:05:08Z

Does any one know how to get error rates for each category of a binary variable in RandomForestClassifier?
I found out obb_prediction_ seemed to be exclusive to RandomForestRegressor.

Thanks.

khan1792 · 2018-03-02T22:23:47Z

Combine the y values and the oob prediction values into a data frame
Split it into two groups (prediction >=0.5 or not)
Respectively calculate the Bayes error rate for two groups

WZheng-94 · 2018-03-03T00:46:15Z

But the problem set asks us to use RandomForestClassifier?

khan1792 · 2018-03-03T00:55:52Z

I mistakenly used the RFregression. But the calculation method remain the same. I just took one more step to transform the prediction numeric values as 1 or 0, while RFclassifier would directly produce them.

WZheng-94 · 2018-03-03T00:59:47Z

My problem is I think RFclassifier doesn't have oob_prediction_ attribute. I hope it's somewhere I missed.

khan1792 · 2018-03-03T01:04:39Z

Ahhhh, I see... haven’t tried it...

khan1792 · 2018-03-03T02:37:26Z

I have checked the original code of sklearn package. In RandomForestClassifier, we can use oob_decision_function_ to calculate the oob prediction.

Transpose the matrix produced by oob_decision_function_
Select the second row of the matrix
Set a cutoff and transform all decimal values as 1 or 0 (>= 0.5 is 1 and otherwise 0)
The list of values we finally get is the oob prediction.

WZheng-94 · 2018-03-03T02:40:53Z

Thank you so much! I will give it a shot later.

…

On Fri, Mar 2, 2018 at 8:37 PM Kanyao Han ***@***.***> wrote: I have check the original code of sklearn package. In RandomForestClassifier, we can use oob_decision_function_ to calculate the oob prediction. 1. Transpose the matrix produced by oob_decision_function_ 2. Select the second raw of the matrix 3. Set a cutoff and transform all decimal values as 1 or 0 (>= 0.5 is 1 and otherwise 0) The list of values we finally get is the oob prediction. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AeFbigat9n1rFV03Sy7Uq5MVs0oHTP-Uks5tagHmgaJpZM4SZI-f> .

AlexanderTyan · 2018-03-05T08:15:21Z

Cross checking against the way sklearn calculates RandomForestClassifier().oob_score_, I believe cutoff for step 3 should be:
(>0.5 is 1 and <=0 is 0). This may impact MSE score calculations because there are a number of edge cases where 0.5=0.5. Sklearn seems to decide conservative in favor of 0's.

khan1792 · 2018-03-05T10:32:51Z

Forget the last message...
Sure, I didn't realize there is a 0.5 in the oob prediction. The oob_score_ classifies 0.5 as 0 and I classified it as 1. This cutoff is tricky in small data, although the classification decision is arbitrary.

AlexanderTyan · 2018-03-05T14:57:11Z

Right, I think 0.5 IS quite arbitrary on the part of sklearn. I just went along with it for MSE calculations because if that's what sklearn decides to come up with predicted y values, it would consistent to evaluate MSE according to the same thresholds.

AlexanderTyan · 2018-03-05T15:26:21Z

Looking to why the 0.5, I found:
https://books.google.com/books?id=bRpYDgAAQBAJ&pg=PT278&lpg=PT278&dq=sklearn+oob_decision_function_&source=bl&ots=h3VyS5rM9K&sig=Yw0RkHTA4whxi8E5teWMiF6bX6o&hl=en&sa=X&ved=0ahUKEwj-rfu519TZAhWl24MKHf-7D8YQ6AEIjQEwCA#v=onepage&q=sklearn%20oob_decision_function_&f=false

It seems that the way sklearn decides if category 0 or 1 is it looks at the relative probabilities of an observation is 0 or 1 in the the oob_decision_function_ matrix. Of second value is larger than the first, it decides in favour of 1. Since probability adds up to one, this effectively means the 0.5 threshold.

khan1792 · 2018-03-05T16:25:24Z

Most classifiers decide the value according to this kind of probability. Theoritically the thread point should not be classified as 1 or 0 because it will make both sides unbalanced. But practically it’s not a problem no matter it is classified as 1 or 0. As long as the sample size is large enough and the number of tree we set in the model is also large, It’s effect on the result will decrease.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oob_prediction_ in RandomForestClassifier #267

oob_prediction_ in RandomForestClassifier #267

WZheng-94 commented Mar 1, 2018 •

edited

Loading

khan1792 commented Mar 2, 2018

WZheng-94 commented Mar 3, 2018

khan1792 commented Mar 3, 2018

WZheng-94 commented Mar 3, 2018

khan1792 commented Mar 3, 2018

khan1792 commented Mar 3, 2018 •

edited

Loading

WZheng-94 commented Mar 3, 2018 via email

AlexanderTyan commented Mar 5, 2018 •

edited

Loading

khan1792 commented Mar 5, 2018

AlexanderTyan commented Mar 5, 2018

AlexanderTyan commented Mar 5, 2018

khan1792 commented Mar 5, 2018

oob_prediction_ in RandomForestClassifier #267

oob_prediction_ in RandomForestClassifier #267

Comments

WZheng-94 commented Mar 1, 2018 • edited Loading

khan1792 commented Mar 2, 2018

WZheng-94 commented Mar 3, 2018

khan1792 commented Mar 3, 2018

WZheng-94 commented Mar 3, 2018

khan1792 commented Mar 3, 2018

khan1792 commented Mar 3, 2018 • edited Loading

WZheng-94 commented Mar 3, 2018 via email

AlexanderTyan commented Mar 5, 2018 • edited Loading

khan1792 commented Mar 5, 2018

AlexanderTyan commented Mar 5, 2018

AlexanderTyan commented Mar 5, 2018

khan1792 commented Mar 5, 2018

WZheng-94 commented Mar 1, 2018 •

edited

Loading

khan1792 commented Mar 3, 2018 •

edited

Loading

AlexanderTyan commented Mar 5, 2018 •

edited

Loading