Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

Open
bixiou opened this issue Feb 28, 2025 · 2 comments

Comments

@bixiou
Copy link

bixiou commented Feb 28, 2025

This is a feature request, cf. here.

@gorcha
Copy link
Member

gorcha commented Mar 2, 2025

Hi @bixiou, can you please explain what feature you are requesting in detail, including why this would be helpful and examples of the desired behaviour? This should all be included in the issue rather linking to external discussion.

Thanks!

@bixiou
Copy link
Author

bixiou commented Mar 2, 2025

Sure, let me explain.

Analyzing surveys, I'd like to allow treating some missing values as special but not as NA. For example, take a variable "vote" taking values "Left", "Center/Right", "Far right" and (the missing value:) "PNR" (People Not Responding). NA would be reserved for a lack of answer (e.g. the question was not asked in this survey branch) rather than for a PNR answer. Below is an example where missing values could be treated differently, in a plot of answers:

Image

I also have to deal with data that is sometimes best handled as numeric, sometimes as character. For example, take a 5-Likert scale variable (named "scale") from "Strongly oppose" to "Strongly agree", which we can recode from -2 to +2. It's handier to write the following condition "scale > 0" rather than "scale %in% c('Somewhat agree', 'Strongly agree')", especially given that the scale labels can change depending on the question. At the same time, I'd like both ways of writing this condition to work.

Finally, in regressions, I would like to keep missing values by default. For example, lm((scale > 0) ~ vote, data = df) should include vote: PNR as a category. If a labelled variable like "vote" is attached numerical values (say from -1 for Left to +1 for Far right), I think it should be treated as categorical by default in regressions.

Ideally, missing values should be treated as a character while not preventing meaningful numerical comparison. For example, if there is the missing value "PNR" in scale[1], we'd ideally have both scale[1] < 0 and scale[1] >= 0 returning FALSE. However, I suspect this is not possible. We'd then have to set a numerical value to missing values (e.g. -0.1) and be careful when manipulating conditions, e.g. remember to use scale < 0 & !is.missing(scale) instead of scale < 0.

To sum up, here are the behaviors I'd wish:

test <- c(1, NA, -1)
test <- ideal_label(test, 
                    labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), 
                    missing.values = c(NA, -1))
  
as.character(test) # "Yes" NA "PNR"
as.numeric(test) # 1 NA -1 (or NaN or NA for test[3])
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
test < 1 # FALSE NA TRUE (or NA for test[3])
test %in% "Yes" # TRUE FALSE FALSE
test == "Yes" # TRUE NA FALSE
is.na(test) # FALSE TRUE FALSE
is.missing(test) # FALSE TRUE TRUE 
lm(c(T, T, T) ~ test)$rank # 2 (i.e., keeps missing values that are not NA)
df <- data.frame(test = test, true = c(T, T, T))
lm(true ~ test, data = df)$rank

The package memisc respects almost all these behaviors, but not the last one. It would be great to find a package that respects all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants