-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for mgcv multi-predictor models #216
Conversation
Oh, maybe it's easier to understand this PR with a small example: library(emmeans)
library(mgcv)
x <- sample(c(0, 1), 100, replace = TRUE)
y <- rnorm(100, x, exp(x))
x <- as.factor(x)
fit1 <- gam(y ~ x)
fit2 <- gam(list(y ~ x, ~ x), family = gaulss())
emmeans(fit1, ~ x)
emmeans(fit2, ~ x)
emmeans(fit1, ~ x, what = 1)
emmeans(fit1, ~ x, what = 2)
emmeans(fit2, ~ x, what = 1)
emmeans(fit2, ~ x, what = 2)
emmeans(fit2, ~ x, what = 2, type = "response")
pairs(emmeans(fit2, ~ x, what = 1))
pairs(emmeans(fit2, ~ x, what = 2, type = "response")) |
Thanks! This looks potentially quite useful. Before merging this in, I wonder if you could look into a couple of changes as I think they'd be fairly simple.
Let me know what you think of these suggestions. And again, thanks. |
Note -- I confirmed that the if (!is.null(sel)) {
bhat = bhat[sel]
X = X[, sel, drop = FALSE]
V = V[sel, sel]
} The problem with that example is that the default reference grid has only one row, so |
Additional note. I don't think replacing |
Thank you for your detailed answer! I will try to fix these things and report back as soon as I'm done.
There are also the
I know, that was just a quick hack for my own purposes. I will try to find a better solution here. 🙂 |
I've added support for
I haven't tested things thoroughly yet, but maybe you could already take a look and let me know if you like the approach? |
Hmmm. We have a ways to go yet, I'm afraid.
For the transformation, the following code in ...
nbasis = estimability::all.estble
# delete the definitions of object$family$*
misc = .std.link.labels(object$family$link[[what]], list())
if (misc$tran == "logb") {
misc$tran = object$family$linfo[[what]]
misc$tran$name = "logb"
}
... What I think was perhaps not clear is that For the partial matching, I don't think we need emm_basis.gam = function(object, trms, xlev, grid,
freq = FALSE, unconditional = FALSE,
what = c("location", "scale", "shape", "zi", "condmean"), ...) {
if (fam_name == "multinom" || fam_name == "mvn") {
if (!is.numeric(what)) {
stop("Family '", fam_name, "' requires a numeric argument 'what'")
}
} else
mode = match.arg(what)
if (fam_name == "ziplss") {
what = switch(what, location = 1, zi = 1, condmean = 2, 99)
} else {
what = switch(what, location = 1, scale = 2, shape = 3, 99)
}
if (what == 99)
stop("Illegal 'what' argument for this model")
... BTW, the way Again, look at how other models' methods are implemented and try to follow the same style. |
PS I'm not sure about |
Thank you for your further comments! This pull request is indeed trickier than expected. I made a few more changes, which I will try to explain here:
Done. I didn't know
Partial matching is implemented now, but the details of my implementation differ from your suggestion. See below.
I don't think this can work, because the
This snippet works, but it disallows numeric
I'm afraid the
I agree, the https://github.com/cran/mgcv/blob/master/R/gamlss.r#L1136 The highlighted line shows that the log-likelihood of the zero-observations depends on only one of the two parameters (the one given by the second linear predictor). So, to sum things up: The first linear predictor defines the rate of a zero-truncated Poisson distribution, and the second linear predictor defines the probability of a positive response value. For this reason, I think
I agree that the Best regards! |
Thanks, Hannes. I think this is at least very close, and I'm going to go ahead and merge this in. I am not totally convinced on the choices of Another thing I'll do after I merge is clean-up the commented-out code that you had to deal with. Ought to make it mode readable for sure. I will investigate further, and keep you copied in. Please do likewise on anything you hear from Simon. Thanks for the contribution! Russ |
Hannes, I added this to the DESCRIPTION file. Please confirm that this is how you want your contribution to be listed: person("Hannes", "Riebl", role = "ctb") (or perhaps more formally, is it Johannes?) A couple other things, briefly:
|
The way some of this stuff is implemented is really nasty. Consider: example(ziplss)
...
> rg0 = ref_grid(b, what="prob", at=list(x0=c(.25,.5,.75), x2 = c(.33,.67)))
> rgr = ref_grid(b, what="rate", at=list(x0=c(.25,.5,.75), x2 = c(.33,.67)))
> summary(rg0)
x2 x3 x0 x1 prediction SE df
0.33 0.497 0.25 0.506 -0.23 0.125 484
0.67 0.497 0.25 0.506 -0.23 0.125 484
0.33 0.497 0.50 0.506 0.38 0.122 484
0.67 0.497 0.50 0.506 0.38 0.122 484
0.33 0.497 0.75 0.506 0.25 0.126 484
0.67 0.497 0.75 0.506 0.25 0.126 484
> summary(rgr)
x2 x3 x0 x1 prediction SE df
0.33 0.497 0.25 0.506 2.13 0.0625 484
0.67 0.497 0.25 0.506 1.20 0.0917 484
0.33 0.497 0.50 0.506 2.13 0.0625 484
0.67 0.497 0.50 0.506 1.20 0.0917 484
0.33 0.497 0.75 0.506 2.13 0.0625 484
0.67 0.497 0.75 0.506 1.20 0.0917 484 This matches > predict(b, newdata = rg0@grid)
[,1] [,2]
1 2.133299 -0.2303537
2 1.201992 -0.2303537
3 2.133299 0.3803800
4 1.201992 0.3803800
5 2.133299 0.2504356
6 1.201992 0.2504356 The default links are both > summary(update(rg0, tran = "logit"), type = "response")
x2 x3 x0 x1 response SE df
0.33 0.497 0.25 0.506 0.443 0.0308 484
0.67 0.497 0.25 0.506 0.443 0.0308 484
0.33 0.497 0.50 0.506 0.594 0.0294 484
0.67 0.497 0.50 0.506 0.594 0.0294 484
0.33 0.497 0.75 0.506 0.562 0.0311 484
0.67 0.497 0.75 0.506 0.562 0.0311 484
> summary(update(rg0, tran = "cloglog"), type = "response")
x2 x3 x0 x1 response SE df
0.33 0.497 0.25 0.506 0.548 0.0448 484
0.67 0.497 0.25 0.506 0.548 0.0448 484
0.33 0.497 0.50 0.506 0.768 0.0413 484
0.67 0.497 0.50 0.506 0.768 0.0413 484
0.33 0.497 0.75 0.506 0.723 0.0450 484
0.67 0.497 0.75 0.506 0.723 0.0450 484
> p = as.data.frame(.Last.value)$response
> summary(update(rgr, tran = "log"), type = "response")
x2 x3 x0 x1 response SE df
0.33 0.497 0.25 0.506 8.44 0.527 484
0.67 0.497 0.25 0.506 3.33 0.305 484
0.33 0.497 0.50 0.506 8.44 0.527 484
0.67 0.497 0.50 0.506 3.33 0.305 484
0.33 0.497 0.75 0.506 8.44 0.527 484
0.67 0.497 0.75 0.506 3.33 0.305 484
> r = as.data.frame(.Last.value)$response Meanwhile, > predict(b, newdata = rg0@grid, type = "response")
1 2 3 4 5 6
4.628267 1.891237 6.488937 2.651557 6.107349 2.495630
> p*r
[1] 4.627270 1.823322 6.487539 2.556340 6.106033 2.406011 These results are nearly but not quite equal. As some might say, "WTF?" Clearly, what they say they are doing is not the same as what they are actually doing. [But this does indeed look like a zero-inflated, not a hurdle model.] |
Thank you, Russell, for putting more thought into this!
This is exactly right. Hannes is not a short form of Johannes (at least not in my case). And thank you for adding me, I appreciate it!
(I get asked that a lot but usually by Germans. Now I'm curious: Do you speak German?)
Maybe it makes sense to offer both options for these models, marginal marginal means and joint marginal means in a sense? I can definitely think of applied modelling problems, where one would use a multivariate normal model (because the correlation parameters are of interest and maybe to stabilize the estimation) but still be interested in the covariate effects on the means of each margin. In fact, I started to work on this pull request because of a project where this would have been the case (we were looking at height and diameter growth of different tree species but switched to a univariate model later for the sake of simplicity). Anyway, let me know if you can use any help with the implementation of this feature!
Use And yes, it really is a mess. The way things are currently implemented (and named/documented) in mgcv, emmeans cannot do things "right". A potential user will always be confused, no matter how we call the linear predictors. That being said, I will report back as soon as I have an answer from Simon. |
I have a colleague named Johannes who goes by Hannes. My name is German in origin, and I can order some food and say bitte and danke, but do not speak German. On the Duh... You're right about the truncation calculations and the hurdle model. Tempted to remove my last comment so my idiocy isn't so exposed. But guess I'll have to leave it. I think I'm good with what we have for now, but will work toward some improvements and implementing the multivariate stuff fairly soon. Look forward to learning anything you hear from Simon. |
I just pushed another update that makes some changes. Specifically it tries to fix the cases where the claimed link is |
Your latest changes look good to me, and
Of course, these things are not really important, but they came to my mind while reading the latest code. Best regards! |
Hannes,
You are right on both of these. I actually looked at ,.std.link.labels, but didn't look hard enough, obviously. Must be my pandemic-addled brain. I'll make those changes
Russ
|
PS -- To get |
The
mgcv::gam()
function supports a couple of multi-predictor models such as Gaussion location-scale regression models, see?mgcv::gaulss
. This PR implements this kind of models with emmeans. It seems to work for me, but I'm not sure if it is 100 % correct. Please review and feel free to merge or ask for adjustments. Cheers!