I tested linear regression using R using categorical attributes and observed that I did not get the coefficient value for each of the different factor factors that I have.
Please see my code below, I have 5 factor levels for states, but see only 4 coefficient values.
> states = c("WA","TE","GE","LA","SF") > population = c(0.5,0.2,0.6,0.7,0.9) > df = data.frame(states,population) > df states population 1 WA 0.5 2 TE 0.2 3 GE 0.6 4 LA 0.7 5 SF 0.9 > states=NULL > population=NULL > lm(formula=population~states,data=df) Call: lm(formula = population ~ states, data = df) Coefficients: (Intercept) statesLA statesSF statesTE statesWA 0.6 0.1 0.3 -0.4 -0.1
I also tried with a large dataset by following these steps, but still see the same behavior
for(i in 1:10) { df = rbind(df,df) }
EDIT: Thanks to the answers from eipi10, MrFlick and the economy. Now I understand that one of the levels is used as a control level. But when I get new test data whose state value is "GE", how can I replace in the equation y = m1x1 + m2x2 + ... + c?
I also tried to smooth the data so that each of these factor levels gets a separate column, but again for one of the columns I get NA as a coefficient. If I have new test data whose state is "WA", how can I get a "population value"? What can I replace as a coefficient?
> df1
population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
lm (formula = population ~ (GE + MI + TE + WA), data = df1)
Call: lm(formula = population ~ (GE + MI + TE + WA), data = df1) Coefficients: (Intercept) GE MI TE WA 1 1 0 1 NA