-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Requests #54
Comments
It would be fantastic to have someone else to help me with this package and grow some (possibly experimental) functionalities. I'm committed to maintaining the package but if you would like to join the team, it would be great! and the things that you're mentioning would definitely be very nice additions |
Here is the simplest Best Subset regression: using Combinatorics; #to generate all subsets
#g() trains every subset
@inline function g(X, y, train, test, n_ops, subs, res)
@inbounds for m = 1:n_ops
b = X[train, subs[m]] \ y[train];
ê = X[test, subs[m]]*b - y[test]; #oos residuals ê=y-ŷ
res[m] = ê'ê;
end
return res
end
#f() trains using all columns of X
@inline function f(X, y, train, test)
b = X[train, :] \ y[train];
ê = X[test, :]*b - y[test]; #oos residuals ê=y-ŷ
sse = ê'ê; #SSE
return sse
end
#Test Drive
using MLJ;
X, y = @load_boston;
Xnames = keys(X);
X = hcat(X...); #convert to array.
train, test = partition(eachindex(y), .7, rng=333);
#
p = size(X,2); #p is the only HP
n_ops = 2^p -1; res = (1e+5)*ones(n_ops);
subs = collect(powerset(1:p, 1)) #combinatorics reduce dependency
@time f(X, y, train, test)
@time g(X, y, train, test, n_ops, subs, res)
extrema(res)
subs[argmin(res)] #variables of best LM
Xnames[ subs[argmin(res)] ]
#using BenchmarkTools
#@benchmark f(X,y)
#@benchmark g(X,y, n_ops, subs, res) Of course this needs to be cross-validated etc. The other valuable lesson from this is that it's impossible to train all models. |
Thanks a lot for this! I'm a bit swamped at the moment but will try to look into this soon. And if anything it can make a nice tutorial too! |
@tlienart
All Subset (Best K Subset):
I've got parsimonious code for All Subset Regression which might have a nice home in this package that I'd like to share.
While it's much faster than R's Leaps pkg, I would include a strong warning to users:
This optimization problem is sooo not convex. Avoid for p>20.
Approximate All Subset (MIO):
As you probably know,
Bertsimas etal (2016): show how to use MIO to approximately train all subset regression for p =1000 in minutes.
Bertsimas et al (2019): do this for p=100k in minutes.
The corresponding Julia package is SubsetSelection.jl.
Also see forthcoming Julia package MIP-Boost.
It would be really awesome if these linear models were featured in MLJ.
Relaxed Lasso:
Hastie et al 2017: responded that relaxed Lasso matches Lasso in low-SNR scenarios & beats All Subset in high-SNR scenarios
glmnet: now features "relaxed Lasso"
Would this be difficult to add to MLJLinearModels? (maybe we can work on it together?)
Linear Interactions:
There are some cool new packages for interactions in linear models.
I find these compare in out-of-sample fit w/ non-linear models (xgboost etc, no joke)
R: sprintr & HierNet
Python: HierScale
Nonconvex penalties (MCP SCAD):
https://github.com/joshday/SparseRegression.jl/blob/master/README.md
Forward selection/stepwise regression:
(stackoverflow, has Julia code)
Textbook code
Ensembles of Penalized Linear Models:
Christidis et al 2019: Paper, old code EnsembleEN, new code SplitReg
The text was updated successfully, but these errors were encountered: