(TRUE or FALSE) dummy variables when interacting using lm ()

Question

(TRUE or FALSE) dummy variables when interacting using lm ()

When I evaluate a model that has an interaction between two variables that are not included in the model as stand-alone variables, and when one of these variables is a dummy (class of “logical”) variable, R “inverts the sign” of the dummy variable. That is, it reports the coefficient estimate on the interaction term when the dummy is FALSE, and not when it is TRUE. Here is an example:

data(trees) trees$dHeight <- trees$Height > 76 trees$cGirth <- trees$Girth - mean(trees$Girth) lm(Volume ~ Girth + Girth:dHeight, data = trees) # estimate is for Girth:dHeightTRUE lm(Volume ~ Girth + cGirth:dHeight, data = trees) # estimate is for cGirth:dHeightFALSE

Why does the regression on the last line give an estimate for the interaction in which dHeight is FALSE and not TRUE? (I would like R to report when dHeight TRUE.)

This is not a big problem, but I would like to better understand why R does what it does. I know about relevel() and contrasts() , but I don’t see them changing here.

+4

r

user697473 Jul 19 '13 at 2:23

source share

2 answers

R does not flip the character to a dummy variable as such. When you enter ~ Girth + cGirth:dHeight , the cGirth variable mixes with the interception term. You can see what happens by removing the interception:

 > lm(Volume ~ -1 + Girth + cGirth:dHeight, data = trees) Call: lm(formula = Volume ~ -1 + Girth + cGirth:dHeight, data = trees) Coefficients: Girth cGirth:dHeightFALSE cGirth:dHeightTRUE 2.199 2.053 3.339

+1

Hong ooi Jul 19 '13 at 2:35

source share

mnel · Accepted Answer · 2013-07-19T03:35:12+0000

dHeight - logical . Inside the model this led to a coefficient, and the levels were sorted lexicographically (i.e., F before T).

As noted in @Hongooi's answer, you cannot evaluate 4 parameters, so R will match the conditions in the order they appear (FALSE to TRUE)

If you want to force R to TRUE , you can fit the model to !dHeight

 lm(formula = Volume ~ Girth + cGirth:!dHeight, data = trees)

Note that !dHeightFALSE equivalent to dHeightTRUE

You will also notice that in this simple case, you simply change the sign to a coefficient, so it really doesn't matter which model suits you.

CHANGE FURTHER BEST APPROACH

R can recognize that cGirth and Girth are colinear, so we can fit, remembering that a/b expands to a + a:b

 lm(formula = Volume ~ Girth + cGirth/dHeight, data = trees) Coefficients: (Intercept) Girth cGirth cGirth:dHeightTRUE -27.198 4.251 NA 1.286

This gives coefficients with easily interpretable names, and R wisely cannot return the coefficient for cGirth

R can say that Girth and cGirth are collinear when they are both “main effect” or autonomous terms.

There is no way that R could indicate when setting Girth + cGirth:dHeight that cGirth and Girth are collinear and provided dHeight is logical, we want cGirthdHeightTRUE be your coefficient, (you could write your own parser so that do it if you want)

another approach that would be consistent with the desired model, and without any collinear terms would be to use

 lm(formula = Volume ~ Girth + I(cGirth*dHeight), data = trees)

which forces dHeight to be numeric ( TRUE becomes 1 ).

Edit the value of the operating point.

When you enter ~Girth + Girth:dHeight

What you are saying is that there is a main effect for Girth + settings for dHeight . R takes into account the first level of the factor of the control level. The slope for dHeightFALSE is just the value for Girth , then you have the setting for dHeight == TRUE (Girth: dHeightTRUE).

When you fit ~Girth + cGirth:dHeight - R does not have a parser that reads the mind, which can say that given cGirth and Girth are linear, when you fit into the interaction of these two terms, it will assume that the second level for dHeight now reference level)

Imagine if you had a variable that was not completely related to Girth

eg,

 set.seed(1) trees$cG <- runif(nrow(trees))

Then, when you enter Girth + cG:dHeight , you will get 4 evaluated options

 lm(formula = Volume ~ Girth + cG:dHeight, data = trees) Call: lm(formula = Volume ~ Girth + cG:dHeight, data = trees) Coefficients: (Intercept) Girth cG:dHeightFALSE cG:dHeightTRUE -31.79645 4.79435 -5.92168 0.09578

It is reasonable.

When R processes Girth + cGirth:dHeight , it will expand (first from the first level of the factor) 1 + Girth + cGirth:dHeightFALSE + cGirth:dHeightTRUE - and it will work that it cannot evaluate all 4 parameters and will evaluate the first 3.

(TRUE or FALSE) dummy variables when interacting using lm ()

More articles: