Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mgcv multi-predictor models #216

Merged
merged 3 commits into from
Jul 27, 2020
Merged

Add support for mgcv multi-predictor models #216

merged 3 commits into from
Jul 27, 2020

Conversation

hriebl
Copy link
Contributor

@hriebl hriebl commented Jul 23, 2020

The mgcv::gam() function supports a couple of multi-predictor models such as Gaussion location-scale regression models, see ?mgcv::gaulss. This PR implements this kind of models with emmeans. It seems to work for me, but I'm not sure if it is 100 % correct. Please review and feel free to merge or ask for adjustments. Cheers!

@hriebl
Copy link
Contributor Author

hriebl commented Jul 23, 2020

Oh, maybe it's easier to understand this PR with a small example:

library(emmeans)
library(mgcv)

x <- sample(c(0, 1), 100, replace = TRUE)
y <- rnorm(100, x, exp(x))
x <- as.factor(x)

fit1 <- gam(y ~ x)
fit2 <- gam(list(y ~ x, ~ x), family = gaulss())

emmeans(fit1, ~ x)
emmeans(fit2, ~ x)

emmeans(fit1, ~ x, what = 1)
emmeans(fit1, ~ x, what = 2)
emmeans(fit2, ~ x, what = 1)
emmeans(fit2, ~ x, what = 2)
emmeans(fit2, ~ x, what = 2, type = "response")

pairs(emmeans(fit2, ~ x, what = 1))
pairs(emmeans(fit2, ~ x, what = 2, type = "response"))

@rvlenth
Copy link
Owner

rvlenth commented Jul 24, 2020

Thanks! This looks potentially quite useful.

Before merging this in, I wonder if you could look into a couple of changes as I think they'd be fairly simple.

  1. I wonder if this could be made more robust by way of supporting other models that have multiple formulas (as I understand it, those are the families gaulss, gevlss, and ziplss). I tried doing emmeans() with the first example for gevlss and it failed -- apparently because X is not a list but rather a single vector. (Perhaps there is a missing drop = FALSE in the underlying code? If so, it could potentially cause trouble with gaulss too.)

  2. To be consistent with the style of emmeans support for other models, I think what should be replaced by, say, mode, and have character choices. For gaulss and gevlss, the natural choices would be c("location", "scale", "shape"); adding in ziplss, I guess the additions would be zi and count. It'd take only a couple lines of code to match those internally with your what parameter.

  3. I don't like that one gets results for fit1, what = 2). It's often nice to be forgiving, but in that particular case I think I'd rather see an error or warning.

  4. Could you update DESCRIPTION listing you as a contributor, and NEWS.md thanking you for the contribution?

Let me know what you think of these suggestions. And again, thanks.

@rvlenth
Copy link
Owner

rvlenth commented Jul 25, 2020

Note -- I confirmed that the gevlss example works if you add drop = FALSE in line 93:

    if (!is.null(sel)) {
        bhat = bhat[sel]
        X = X[, sel, drop = FALSE]
        V = V[sel, sel]
    }

The problem with that example is that the default reference grid has only one row, so X[, sel] becomes a vector instead of a matrix.

@rvlenth
Copy link
Owner

rvlenth commented Jul 25, 2020

Additional note. I don't think replacing logb by log is right. However, object$family$linfo[[what]] returns a list that looks like it will work correctly as the misc$tran slot. It would be helpful to add misc$tran$name = "logb", then it will be labeled as such in the summaries.

@hriebl
Copy link
Contributor Author

hriebl commented Jul 25, 2020

Thank you for your detailed answer! I will try to fix these things and report back as soon as I'm done.

I wonder if this could be made more robust by way of supporting other models that have multiple formulas (as I understand it, those are the families gaulss, gevlss, and ziplss). I tried doing emmeans() with the first example for gevlss and it failed -- apparently because X is not a list but rather a single vector. (Perhaps there is a missing drop = FALSE in the underlying code? If so, it could potentially cause trouble with gaulss too.)

There are also the mvn and multinom families. Should we try to cover those as well?

Additional note. I don't think replacing logb by log is right. However, object$family$linfo[[what]] returns a list that looks like it will work correctly as the misc$tran slot. It would be helpful to add misc$tran$name = "logb", then it will be labeled as such in the summaries.

I know, that was just a quick hack for my own purposes. I will try to find a better solution here. 🙂

@hriebl
Copy link
Contributor Author

hriebl commented Jul 26, 2020

I've added support for

  • strings as a what argument, and
  • the logb link function.

I haven't tested things thoroughly yet, but maybe you could already take a look and let me know if you like the approach?

@rvlenth
Copy link
Owner

rvlenth commented Jul 26, 2020

Hmmm. We have a ways to go yet, I'm afraid.

  • I don't want the changes you made in transformations.R, because the "logb" thing is specific to this package.
    I need to ask that you revert transformations.R to what it was before.
  • There is no partial matching of character arguments.

For the transformation, the following code in emm_basis.gam is more like what I'm thinking:

        ...
        nbasis = estimability::all.estble
        # delete the definitions of object$family$*
        misc = .std.link.labels(object$family$link[[what]], list())
        if (misc$tran == "logb") {
            misc$tran = object$family$linfo[[what]]
            misc$tran$name = "logb"
        }
        ...

What I think was perhaps not clear is that tran can be character or a list with the needed elements, and that linfo[[what]] has everything that is needed, except for the name.

For the partial matching, I don't think we need what_arg in addition to what. The function beginning should look something like the following. Note that all the available string options are included in the header and matched as needed.

emm_basis.gam = function(object, trms, xlev, grid,
                         freq = FALSE, unconditional = FALSE,
                         what = c("location", "scale", "shape", "zi", "condmean"), ...) {
   
    if (fam_name == "multinom" || fam_name == "mvn") {
        if (!is.numeric(what)) {
            stop("Family '", fam_name, "' requires a numeric argument 'what'")
        }
    } else
        mode = match.arg(what)
    if (fam_name == "ziplss") {
        what = switch(what, location = 1, zi = 1, condmean = 2,   99)
    } else {
        what = switch(what, location = 1, scale = 2, shape = 3,   99)
    }
    if (what == 99)
        stop("Illegal 'what' argument for this model")
    ...

BTW, the way ziplss is documented is confusing, but "presence" occurs in both parts so I think it's not a good choice to distinguish them. I changed it to condmean and added zi. Both are kind of borrowed from the choices made in the glmmTMB package (see https://cran.r-project.org/web/packages/glmmTMB/glmmTMB.pdf, p. 3)

Again, look at how other models' methods are implemented and try to follow the same style.

@rvlenth
Copy link
Owner

rvlenth commented Jul 26, 2020

PS I'm not sure about multinom and mvn yet. I would think that a user would want to look at all dimensions together, as it is a multivariate model. But having different formulas and potentially different predictors involved makes this pretty complicated.

@hriebl
Copy link
Contributor Author

hriebl commented Jul 27, 2020

Thank you for your further comments! This pull request is indeed trickier than expected. I made a few more changes, which I will try to explain here:

  • I don't want the changes you made in transformations.R, because the "logb" thing is specific to this package.
    I need to ask that you revert transformations.R to what it was before.

Done. I didn't know tran could be a list, but now I followed your suggestion on this point.

  • There is no partial matching of character arguments.

Partial matching is implemented now, but the details of my implementation differ from your suggestion. See below.

...
# delete the definitions of object$family$*
misc = .std.link.labels(object$family$link[[what]], list())
...

I don't think this can work, because the .std.link.labels() function doesn't operate on strings.

...
if (fam_name == "multinom" || fam_name == "mvn") {
    if (!is.numeric(what)) {
        stop("Family '", fam_name, "' requires a numeric argument 'what'")
    }
} else
    mode = match.arg(what)
...

This snippet works, but it disallows numeric what arguments for anything but the multinom and the mvn families. If this is the behaviour you want, I'm happy to implement it, but I feel numeric what arguments would be convenient and consistent with the way the models are specified in the gam() calls (i.e. as unnamed lists).

...
if (fam_name == "ziplss") {
    what = switch(what, location = 1, zi = 1, condmean = 2,   99)
} else {
    what = switch(what, location = 1, scale = 2, shape = 3,   99)
}
if (what == 99)
    stop("Illegal 'what' argument for this model")
...

I'm afraid the what == 99 condition would not catch all illegal arguments here, i.e. if someone used what = "scale" with a simple single-predictor gaussian family, what would be 2 (but still illegal). For this reason, I stuck with the checks I had implemented previously (at least for now).

BTW, the way ziplss is documented is confusing, but "presence" occurs in both parts so I think it's not a good choice to distinguish them. I changed it to condmean and added zi. Both are kind of borrowed from the choices made in the glmmTMB package (see https://cran.r-project.org/web/packages/glmmTMB/glmmTMB.pdf, p. 3)

I agree, the ziplss family is very confusing. In fact, I found out that it is what is called a hurdle model in most other contexts rather than a zero-inflated model. This is not documented, however, and to see this, one needs to take a look at the definition of the ziplss log-likelihood:

https://github.com/cran/mgcv/blob/master/R/gamlss.r#L1136

The highlighted line shows that the log-likelihood of the zero-observations depends on only one of the two parameters (the one given by the second linear predictor). So, to sum things up: The first linear predictor defines the rate of a zero-truncated Poisson distribution, and the second linear predictor defines the probability of a positive response value. For this reason, I think condmean and zi are not ideal parameter names, but I can implement them if you still like them best. I'm also thinking about sending Simon Wood a message about a documentation update for the ziplss family...

PS I'm not sure about multinom and mvn yet. I would think that a user would want to look at all dimensions together, as it is a multivariate model. But having different formulas and potentially different predictors involved makes this pretty complicated.

I agree that the multinom and mvn families are conceptually different from the others, but treating the different predictors as if they were one is probably not an ideal solution either, I guess. Maybe we could do the latter but print a warning about it? It's up to you, just let me know about your decision on these two families...

Best regards!

@rvlenth
Copy link
Owner

rvlenth commented Jul 27, 2020

Thanks, Hannes. I think this is at least very close, and I'm going to go ahead and merge this in. I am not totally convinced on the choices of "rate" and "prob.gt.0", and most particularly wonder if they are not in reverse order to what is documented. Reading this again I do think it is likely the hurdle model not the ZI model, in which case "ziplss" is a really poor naming choice.

Another thing I'll do after I merge is clean-up the commented-out code that you had to deal with. Ought to make it mode readable for sure.

I will investigate further, and keep you copied in. Please do likewise on anything you hear from Simon.

Thanks for the contribution!

Russ

@rvlenth rvlenth merged commit 9a2e155 into rvlenth:master Jul 27, 2020
@rvlenth
Copy link
Owner

rvlenth commented Jul 27, 2020

Hannes,

I added this to the DESCRIPTION file. Please confirm that this is how you want your contribution to be listed:

person("Hannes", "Riebl", role = "ctb")

(or perhaps more formally, is it Johannes?)

A couple other things, briefly:

  • I'm now more convinced that you have the order of things right in the ziplss case
  • On multinom and mvn, the "right" thing to do is probably to obtain the union of all the variables involved in the various formulas, and use those to define the reference grid; and add a pseudo-factor (say, dim) to define which dimension. That's what's done with multivariate models, but it's easier because we can just create the reference grid as a Kronecker product. And similar is true of nnet::multinom` models, but again we are helped by them all being the same formula. This isn't insurmountable, just tedious.

@rvlenth
Copy link
Owner

rvlenth commented Jul 27, 2020

The way some of this stuff is implemented is really nasty. Consider:

example(ziplss)
...
> rg0 = ref_grid(b, what="prob", at=list(x0=c(.25,.5,.75), x2 = c(.33,.67)))
> rgr = ref_grid(b, what="rate", at=list(x0=c(.25,.5,.75), x2 = c(.33,.67)))

> summary(rg0)
   x2    x3   x0    x1 prediction    SE  df
 0.33 0.497 0.25 0.506      -0.23 0.125 484
 0.67 0.497 0.25 0.506      -0.23 0.125 484
 0.33 0.497 0.50 0.506       0.38 0.122 484
 0.67 0.497 0.50 0.506       0.38 0.122 484
 0.33 0.497 0.75 0.506       0.25 0.126 484
 0.67 0.497 0.75 0.506       0.25 0.126 484

> summary(rgr)
   x2    x3   x0    x1 prediction     SE  df
 0.33 0.497 0.25 0.506       2.13 0.0625 484
 0.67 0.497 0.25 0.506       1.20 0.0917 484
 0.33 0.497 0.50 0.506       2.13 0.0625 484
 0.67 0.497 0.50 0.506       1.20 0.0917 484
 0.33 0.497 0.75 0.506       2.13 0.0625 484
 0.67 0.497 0.75 0.506       1.20 0.0917 484

This matches

> predict(b, newdata = rg0@grid)
      [,1]       [,2]
1 2.133299 -0.2303537
2 1.201992 -0.2303537
3 2.133299  0.3803800
4 1.201992  0.3803800
5 2.133299  0.2504356
6 1.201992  0.2504356

The default links are both "identity", but the documentation says that the actual models used use "log" and ("logit" or "cloglog", depending on where you read) links, respectively.

> summary(update(rg0, tran = "logit"), type = "response")
   x2    x3   x0    x1 response     SE  df
 0.33 0.497 0.25 0.506    0.443 0.0308 484
 0.67 0.497 0.25 0.506    0.443 0.0308 484
 0.33 0.497 0.50 0.506    0.594 0.0294 484
 0.67 0.497 0.50 0.506    0.594 0.0294 484
 0.33 0.497 0.75 0.506    0.562 0.0311 484
 0.67 0.497 0.75 0.506    0.562 0.0311 484

> summary(update(rg0, tran = "cloglog"), type = "response")
   x2    x3   x0    x1 response     SE  df
 0.33 0.497 0.25 0.506    0.548 0.0448 484
 0.67 0.497 0.25 0.506    0.548 0.0448 484
 0.33 0.497 0.50 0.506    0.768 0.0413 484
 0.67 0.497 0.50 0.506    0.768 0.0413 484
 0.33 0.497 0.75 0.506    0.723 0.0450 484
 0.67 0.497 0.75 0.506    0.723 0.0450 484
> p = as.data.frame(.Last.value)$response

> summary(update(rgr, tran = "log"), type = "response")
   x2    x3   x0    x1 response    SE  df
 0.33 0.497 0.25 0.506     8.44 0.527 484
 0.67 0.497 0.25 0.506     3.33 0.305 484
 0.33 0.497 0.50 0.506     8.44 0.527 484
 0.67 0.497 0.50 0.506     3.33 0.305 484
 0.33 0.497 0.75 0.506     8.44 0.527 484
 0.67 0.497 0.75 0.506     3.33 0.305 484
> r = as.data.frame(.Last.value)$response

Meanwhile,

> predict(b, newdata = rg0@grid, type = "response")
       1        2        3        4        5        6 
4.628267 1.891237 6.488937 2.651557 6.107349 2.495630 

> p*r
[1] 4.627270 1.823322 6.487539 2.556340 6.106033 2.406011

These results are nearly but not quite equal. As some might say, "WTF?" Clearly, what they say they are doing is not the same as what they are actually doing. [But this does indeed look like a zero-inflated, not a hurdle model.]

@hriebl
Copy link
Contributor Author

hriebl commented Jul 28, 2020

Thank you, Russell, for putting more thought into this!

person("Hannes", "Riebl", role = "ctb")

This is exactly right. Hannes is not a short form of Johannes (at least not in my case). And thank you for adding me, I appreciate it!

(or perhaps more formally, is it Johannes?)

(I get asked that a lot but usually by Germans. Now I'm curious: Do you speak German?)

  • On multinom and mvn, the "right" thing to do is probably to obtain the union of all the variables involved in the various formulas, and use those to define the reference grid; and add a pseudo-factor (say, dim) to define which dimension. That's what's done with multivariate models, but it's easier because we can just create the reference grid as a Kronecker product. And similar is true of nnet::multinom models, but again we are helped by them all being the same formula. This isn't insurmountable, just tedious.

Maybe it makes sense to offer both options for these models, marginal marginal means and joint marginal means in a sense?

I can definitely think of applied modelling problems, where one would use a multivariate normal model (because the correlation parameters are of interest and maybe to stabilize the estimation) but still be interested in the covariate effects on the means of each margin. In fact, I started to work on this pull request because of a project where this would have been the case (we were looking at height and diameter growth of different tree species but switched to a univariate model later for the sake of simplicity).

Anyway, let me know if you can use any help with the implementation of this feature!

> p*r
[1] 4.627270 1.823322 6.487539 2.556340 6.106033 2.406011

These results are nearly but not quite equal. As some might say, "WTF?" Clearly, what they say they are doing is not the same as what they are actually doing. [But this does indeed look like a zero-inflated, not a hurdle model.]

Use p*r/(1-exp(-r)) (i.e. p times the expectation of the zero-truncated Poisson distribution) and you'll get exactly the same results. I'm pretty confident it actually is a hurdle model, as I spent a fair amount of time deciphering the mgcv source code yesterday.

And yes, it really is a mess. The way things are currently implemented (and named/documented) in mgcv, emmeans cannot do things "right". A potential user will always be confused, no matter how we call the linear predictors. That being said, I will report back as soon as I have an answer from Simon.

@rvlenth
Copy link
Owner

rvlenth commented Jul 28, 2020

I have a colleague named Johannes who goes by Hannes. My name is German in origin, and I can order some food and say bitte and danke, but do not speak German.

On the mvn model, if we have a complete reference grid for all the dimensions, we can always select a dimension or average over them using standard emmeans() capabilities, so we don't really need a separate option to do that.

Duh... You're right about the truncation calculations and the hurdle model. Tempted to remove my last comment so my idiocy isn't so exposed. But guess I'll have to leave it.

I think I'm good with what we have for now, but will work toward some improvements and implementing the multivariate stuff fairly soon. Look forward to learning anything you hear from Simon.

rvlenth added a commit that referenced this pull request Jul 28, 2020
@rvlenth
Copy link
Owner

rvlenth commented Jul 29, 2020

I just pushed another update that makes some changes. Specifically it tries to fix the cases where the claimed link is "identity" but actually it's something else. I also decided that the logb link info in linfo is not correct; I replaced it with make.tran("genlog", -0.01). But who knows, really?

@rvlenth rvlenth mentioned this pull request Jul 29, 2020
@hriebl
Copy link
Contributor Author

hriebl commented Jul 30, 2020

Your latest changes look good to me, and make.tran("genlog", -0.01) gives much more reasonable results for emmeans(fit2, ~ x, what = 2, type = "response") in the example in #216 (comment). Just two very minor comments:

  • Maybe it makes sense to change
    misc = .std.link.labels(list(link = link), list())
    to
    misc = .std.link.labels(list(link = link, family = fam_name), list())
    here:
    https://github.com/rvlenth/emmeans/blob/master/R/gam-support.R#L126.
    Otherwise, .std.link.labels() won't do some of the things it's supposed to do.
  • The b in the logb link is configurable. You could use
    1/object$family$linfo[[what_num]]$linkinv(-Inf)
    to retrieve its value instead of hard-coding its default.

Of course, these things are not really important, but they came to my mind while reading the latest code.

Best regards!

@rvlenth
Copy link
Owner

rvlenth commented Jul 30, 2020 via email

@rvlenth
Copy link
Owner

rvlenth commented Jul 30, 2020

PS -- To get b, I'm using - environment(object$family$linfo[[what_num]]$linkfun)$b. Took some digging to find this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants