PU learning - method where binary classifier is learned in a semi-supervised way from only positive and unlabeled sample points. It is especially usefull as a worm start for some applications. After trying out multiple methods from Bing Lu webpage also recent Masashi Sugiyama works and this blog post I stumbled upon one forum post which lead me to the paper by Liu, Bing, et al. "Partially supervised classification of text documents." ICML. Vol. 2. 2002. It contains beautiful idea of Spies which will be implement here and outperforms most of other methods while being much faster.
General idea follows these steps:
- Train model on MS (mix + spies) as negative class and P (positive) as positive class.
- Use trained model on MS to predict likelihoods and based on distribution of S choose threshold such that S are separated well. This leads to N (likely negative group with some S) and U (unlabeled group, containing most of S).
- Use N as negative and P as positive group to train new classifier. Use it as your final model.
Note, that spies uses two models on the way and any classifier can be used. In experimental results presented bellow XGBClassifier will be used.
All you need is pu_learning.py file (everything else in this repo are just for explanation purposes).
To use spies method you should pass two models and then use fit and predict actions as follows:
from pu_learning import spies
model = spies(XGBClassifier(), XGBClassifier())
model.fit(X, y)
model.predict(X)
Experiments were executed using notebook from this blog post. I have used baggingPU.py from this repo with DecisionTreeClassifier and SVC at it's core, so that we have some benchmark to beat. Bagging is much slower than spies and even though SVC's perform better than bagging with tress in my use case it is simply infeasable to use due to performance.
Results achieved by PU spies.
When compared to other methods, PU spies clearly stands out.
Method | Run time |
---|---|
Standard classifier | 2.6 s |
PU bagging (tree) | 4.03 s |
PU bagging (SVC) | 55.6 s |
Spies | 0.2 s |
Spies using XGB is much faster and as accurate as other PU methods.