Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

bixiou · 2025-02-28T17:13:12Z

This is a feature request, cf. here.

gorcha · 2025-03-02T07:29:55Z

Hi @bixiou, can you please explain what feature you are requesting in detail, including why this would be helpful and examples of the desired behaviour? This should all be included in the issue rather linking to external discussion.

Thanks!

bixiou · 2025-03-02T15:48:10Z

Sure, let me explain.

Analyzing surveys, I'd like to allow treating some missing values as special but not as NA. For example, take a variable "vote" taking values "Left", "Center/Right", "Far right" and (the missing value:) "PNR" (People Not Responding). NA would be reserved for a lack of answer (e.g. the question was not asked in this survey branch) rather than for a PNR answer. Below is an example where missing values could be treated differently, in a plot of answers:

I also have to deal with data that is sometimes best handled as numeric, sometimes as character. For example, take a 5-Likert scale variable (named "scale") from "Strongly oppose" to "Strongly agree", which we can recode from -2 to +2. It's handier to write the following condition "scale > 0" rather than "scale %in% c('Somewhat agree', 'Strongly agree')", especially given that the scale labels can change depending on the question. At the same time, I'd like both ways of writing this condition to work.

Finally, in regressions, I would like to keep missing values by default. For example, lm((scale > 0) ~ vote, data = df) should include vote: PNR as a category. If a labelled variable like "vote" is attached numerical values (say from -1 for Left to +1 for Far right), I think it should be treated as categorical by default in regressions.

Ideally, missing values should be treated as a character while not preventing meaningful numerical comparison. For example, if there is the missing value "PNR" in scale[1], we'd ideally have both scale[1] < 0 and scale[1] >= 0 returning FALSE. However, I suspect this is not possible. We'd then have to set a numerical value to missing values (e.g. -0.1) and be careful when manipulating conditions, e.g. remember to use scale < 0 & !is.missing(scale) instead of scale < 0.

To sum up, here are the behaviors I'd wish:

test <- c(1, NA, -1)
test <- ideal_label(test, 
                    labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), 
                    missing.values = c(NA, -1))
  
as.character(test) # "Yes" NA "PNR"
as.numeric(test) # 1 NA -1 (or NaN or NA for test[3])
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
test < 1 # FALSE NA TRUE (or NA for test[3])
test %in% "Yes" # TRUE FALSE FALSE
test == "Yes" # TRUE NA FALSE
is.na(test) # FALSE TRUE FALSE
is.missing(test) # FALSE TRUE TRUE 
lm(c(T, T, T) ~ test)$rank # 2 (i.e., keeps missing values that are not NA)
df <- data.frame(test = test, true = c(T, T, T))
lm(true ~ test, data = df)$rank

The package memisc respects almost all these behaviors, but not the last one. It would be great to find a package that respects all of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

bixiou commented Feb 28, 2025

gorcha commented Mar 2, 2025

bixiou commented Mar 2, 2025

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character #770

Comments

bixiou commented Feb 28, 2025

gorcha commented Mar 2, 2025

bixiou commented Mar 2, 2025