In statistics :: glm (), why does the subset argument give different results when I multiply the data argument myself?

Question

In statistics :: glm (), why does the subset argument give different results when I multiply the data argument myself?

Consider the following code:

library(ISLR)

row_list <- structure(list(`1` = 1:40, `2` = 41:79, `3` = 80:118, `4` = 119:157, 
               `5` = 158:196, `6` = 197:235, `7` = 236:274, `8` = 275:313, 
               `9` = 314:352, `10` = 353:392), 
          .Names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
test <- row_list[[1]]
train <- setdiff(unlist(row_list), row_list[[1]])

Output 1:

> glm(mpg ~ poly(horsepower, 1), data = Auto, subset = train)

Call:  glm(formula = mpg ~ poly(horsepower, 1), data = Auto, subset = train)

Coefficients:
        (Intercept)  poly(horsepower, 1)  
              23.37              -133.05  

Degrees of Freedom: 351 Total (i.e. Null);  350 Residual
Null Deviance:      21460 
Residual Deviance: 8421     AIC: 2122

Output 2:

> glm(mpg ~ poly(horsepower, 1), data = Auto[train,])

Call:  glm(formula = mpg ~ poly(horsepower, 1), data = Auto[train, ])

Coefficients:
        (Intercept)  poly(horsepower, 1)  
              24.05              -114.19  

Degrees of Freedom: 351 Total (i.e. Null);  350 Residual
Null Deviance:      21460 
Residual Deviance: 8421     AIC: 2122

As you can see above, the values of (Intercept)and poly(horsepower, 1)differ between the two outputs. Why is this?

At least for lm(), Introduction to Statistical Learning suggests (see page 191) that row indices can be used in an argument subset. Does this apply glm()or is it subsetjust not being used correctly?

+4

r

Clarinetist Oct 25 '17 at 17:55

source share

1 answer

James · Accepted Answer · 2017-10-25T18:14:41+0000

This is due to how orthogonal polynomials are constructed on poly.

, ( glm).

:

coef(glm(mpg~poly(hp,1),data=mtcars,subset=10:32))
(Intercept) poly(hp, 1) 
   20.63307   -28.66876 
coef(glm(mpg~poly(hp,1),data=mtcars[10:32,]))
(Intercept) poly(hp, 1) 
   19.93043   -25.43935 
coef(glm(mpg~poly(hp,1,raw=TRUE),data=mtcars,subset=10:32))
            (Intercept) poly(hp, 1, raw = TRUE) 
            31.64927851             -0.07509986 
coef(glm(mpg~poly(hp,1,raw=TRUE),data=mtcars[10:32,]))
            (Intercept) poly(hp, 1, raw = TRUE) 
            31.64927851             -0.07509986

In statistics :: glm (), why does the subset argument give different results when I multiply the data argument myself?

Output 1:

Output 2:

More articles: