Forcing a reference category in a logistic model in R

Question

Forcing a reference category in a logistic model in R

Using R, I run the logistic model and should include the interaction member as follows, where A is categorical and B is continuous.

Y ~ A + B + normalized(B):A

My problem is that when I do this, the reference category is not the same as in

 Y ~ A + B + A:B

which makes it difficult to compare models. I'm sure there is a way to make the link category be the same all the time, but it seems it cannot find a direct answer.

To illustrate, my data looks like this:

 income ndvi sga 30,000$ - 49,999$ -0,141177617 0 30,000$ - 49,999$ -0,170513257 0 >80,000$ -0,054939323 1 >80,000$ -0,14724104 0 >80,000$ -0,207678157 0 missing -0,229890869 1 50,000$ - 79,999$ 0,245063253 0 50,000$ - 79,999$ 0,127565529 0 15,000$ - 29,999$ -0,145778357 0 15,000$ - 29,999$ -0,170944338 0 30,000$ - 49,999$ -0,121060635 0 30,000$ - 49,999$ -0,245407291 0 missing -0,156427532 0 >80,000$ 0,033541238 0

And the outputs are reproduced below. The first set of results is the form of the model Y ~ A * B, and the second, Y ~ A + B + A: normalized (B)

  Estimate Std. Error z value Pr(>|z|) (Intercept) -2.72175 0.29806 -9.132 <2e-16 *** ndvi 2.78106 2.16531 1.284 0.1990 income15,000$ - 29,999$ -0.53539 0.46211 -1.159 0.2466 income30,000$ - 49,999$ -0.68254 0.39479 -1.729 0.0838 . income50,000$ - 79,999$ -0.13429 0.33097 -0.406 0.6849 income>80,000$ -0.56692 0.35144 -1.613 0.1067 incomemissing -0.85257 0.47230 -1.805 0.0711 . ndvi:income15,000$ - 29,999$ -2.27703 3.25433 -0.700 0.4841 ndvi:income30,000$ - 49,999$ -3.76892 2.86099 -1.317 0.1877 ndvi:income50,000$ - 79,999$ -0.07278 2.46483 -0.030 0.9764 ndvi:income>80,000$ -3.32489 2.62000 -1.269 0.2044 ndvi:incomemissing -3.98098 3.35447 -1.187 0.2353 Estimate Std. Error z value Pr(>|z|) (Intercept) -3.07421 0.30680 -10.020 <2e-16 *** ndvi -1.19992 2.56201 -0.468 0.640 income15,000$ - 29,999$ -0.33379 0.29920 -1.116 0.265 income30,000$ - 49,999$ -0.34885 0.26666 -1.308 0.191 income50,000$ - 79,999$ -0.12784 0.25124 -0.509 0.611 income>80,000$ -0.27255 0.27288 -0.999 0.318 incomemissing -0.50010 0.31299 -1.598 0.110 income<15,000$:normalize(ndvi) 0.40515 0.34139 1.187 0.235 income15,000$ - 29,999$:normalize(ndvi) 0.17341 0.35933 0.483 0.629 income30,000$ - 49,999$:normalize(ndvi) 0.02158 0.32280 0.067 0.947 income50,000$ - 79,999$:normalize(ndvi) 0.39774 0.28697 1.386 0.166 income>80,000$:normalize(ndvi) 0.06677 0.30087 0.222 0.824 incomemissing:normalize(ndvi) NA NA NA NA

So, in the first model, the category "revenue <15,000" is a reference category, while in the second model something else happens, which I still do not quite understand.

+4

r interaction

Dominic Comtois Sep 17 '12 at 14:28

source share

1 answer

Dj · Answer 1 · 2014-06-08T13:08:33+0000

Say we would like to perform a regression on this equation .

we tried to implement it using model.matrix . But there are some automation problems illustrated in the results below. Is there a better way to implement it? . To be more specific, let's say that X_1 is a continuous variable, and X_2 is dummy.

Basically, the interpretation of the term interaction will be the same, except that the main term X_2 will be evaluated when X_1 is in its average value. (see Early draft of this article )

Here is some data to illustrate my point: (This is not glm, but we can apply the same method to glm)

 library(car) str(Prestige) # some data cleaning Prestige <- Prestige[!is.na(Prestige$type),] # interaction the usual way. lm1 <- lm(income ~ education+ type + education:type, data = Prestige); summary(lm1) # interacting with demeaned education Prestige$education_ <- Prestige$education-mean(Prestige$education)

Using the regular formula method, things don't work out the way we want. Since the formula does not put any variable as a reference

 lm2 <- lm(income ~ education+ type + education_:type, data = Prestige); summary(lm2) # Using model.matrix to shape the interaction cusInt <- model.matrix(~-1+education_:type,data=Prestige)[,-1];colnames(cusInt) lm3 <- lm(income ~ education+ type + cusInt, data = Prestige); summary(lm3) compareCoefs(lm1,lm3,lm2)

The results are here:

  Est. 1 SE 1 Est. 2 SE 2 Est. 3 SE 3 (Intercept) -1865 3682 -1865 3682 4280 8392 education 866 436 866 436 297 770 typeprof -3068 7192 -542 1950 -542 1950 typewc 3646 9274 -2498 1377 -2498 1377 education:typeprof 234 617 education:typewc -569 885 cusInteducation_:typeprof 234 617 cusInteducation_:typewc -569 885 typebc:education_ 569 885 typeprof:education_ 803 885 typewc:education_

So basically when using model.matrix we need to intervene to set the reference variable. In addition, there is some custInt that appears before the variable name, so the formatting results, when you have a lot of tables to compare, are quite tedious.

Forcing a reference category in a logistic model in R

More articles: