All factor levels in the model matrix in R

I have a data.frame consisting of numeric and factor variables, as shown below.

 testFrame <- data.frame(First=sample(1:10, 20, replace=T), Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T), Fourth=rep(c("Alice","Bob","Charlie","David"), 5), Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4)) 

I want to build a matrix that assigns factorial variables to a factor and leaves only numeric variables.

 model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame) 

As expected, when running lm this eliminates one level of each factor as a reference level. However, I want to build a matrix with a dummy / indicator variable for each level of all factors. I am glmnet this matrix for glmnet , so I don't care about multicollinearity.

Is there a way to have a model.matrix create a dummy for each factor level?

+60
matrix r model
Dec 30 '10 at 6:18
source share
10 answers

For factor variables, you need to reset contrasts :

 model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), Fifth=contrasts(testFrame$Fifth, contrasts=F))) 

or, with a slightly smaller typing and without proper names:

 model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), Fifth=diag(nlevels(testFrame$Fifth)))) 
+48
Dec 30 '10 at 9:38
source share

(Trying to redeem yourself ...) In response to Jared's comment that @Fabians is talking about automation, note that all you need to provide is a named list of contrast matrices. contrasts() takes a vector / coefficient and produces a contrast matrix from it. To do this, we can use lapply() to run contrasts() for each coefficient in our dataset, for example. for example, testFrame provided:

 > lapply(testFrame[,4:5], contrasts, contrasts = FALSE) $Fourth Alice Bob Charlie David Alice 1 0 0 0 Bob 0 1 0 0 Charlie 0 0 1 0 David 0 0 0 1 $Fifth Edward Frank Georgia Hank Isaac Edward 1 0 0 0 0 Frank 0 1 0 0 0 Georgia 0 0 1 0 0 Hank 0 0 0 1 0 Isaac 0 0 0 0 1 

Which slots nicely in @fabians answer:

 model.matrix(~ ., data=testFrame, contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE)) 
+61
Dec 31 '10 at 9:26
source share

caret implemented a nice dummyVars function to achieve this goal in two lines:

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

Checking the final columns:

 colnames(testFrame2) "First" "Second" "Third" "Fourth.Alice" "Fourth.Bob" "Fourth.Charlie" "Fourth.David" "Fifth.Edward" "Fifth.Frank" "Fifth.Georgia" "Fifth.Hank" "Fifth.Isaac" 

The best part is that you get the original data frame, as well as dummy variables that exclude the original ones used for the conversion.

Additional information: http://amunategui.imtqy.com/dummyVar-Walkthrough/

+14
Dec 28 '16 at 18:08
source share

dummyVars from caret can also be used. http://caret.r-forge.r-project.org/preprocess.html

+11
Mar 14 '13 at 2:29
source share

Ok Just by reading above and all together. Suppose you need a matrix, for example. "X. factors" that are multiplied by your coefficient vector to get your linear predictor. There are a few additional steps:

 X.factors = model.matrix( ~ ., data=X, contrasts.arg = lapply(data.frame(X[,sapply(data.frame(X), is.factor)]), contrasts, contrasts = FALSE)) 

(Note that you need to turn X [*] back into the data frame if you have only one column of factors.)

Then say that you got something like this:

 attr(X.factors,"assign") [1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added 

We want to get rid of the reference levels ** -d of each factor

 att = attr(X.factors,"assign") factor.columns = unique(att[duplicated(att)]) unwanted.columns = match(factor.columns,att) X.factors = X.factors[,-unwanted.columns] X.factors = (data.matrix(X.factors)) 
+2
Jul 24 '14 at 18:05
source share

Using the CatEncoders R Package

 library(CatEncoders) testFrame <- data.frame(First=sample(1:10, 20, replace=T), Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T), Fourth=rep(c("Alice","Bob","Charlie","David"), 5), Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4)) fit <- OneHotEncoder.fit(testFrame) z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output z <- transform(fit,testFrame,sparse=FALSE) # give the dense output 
+2
Sep 14 '16 at 1:56 on
source share

I am currently studying the Lasso and glmnet::cv.glmnet() , model.matrix() and Matrix::sparse.model.matrix() (for a large matrix, using model.matrix our time, as glmnet .).

Just sharing there is neatly encoded to get the same answer as @fabians and @Gavin. Meanwhile, @ asdf123 also introduced another library('CatEncoders') library('CatEncoders') package library('CatEncoders') .

 > require('useful') > # always use all levels > build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE) > > # just use all levels for Fourth > build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE)) 

Source: R for All: Advanced Analytics and Graphics (p. 273)

+2
Jan 15 '17 at 17:59 on
source share

tidyverse answer:

 library(dplyr) library(tidyr) result <- testFrame %>% mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "") 

gives the desired result (similar to @ Gavin Simpson's answer):

 > head(result, 6) First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac 1 1 5 4 0 0 1 0 0 1 0 0 0 2 1 14 10 0 0 0 1 0 0 1 0 0 3 2 2 9 0 1 0 0 1 0 0 0 0 4 2 5 4 0 0 0 1 0 1 0 0 0 5 2 13 5 0 0 1 0 1 0 0 0 0 6 2 15 7 1 0 0 0 1 0 0 0 0 
+1
Feb 16 '19 at 9:43
source share
 model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame) 

or

 model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame) 

should be the easiest

0
Sep 04 '15 at 8:05
source share

Stats package response:

 new_tr <- model.matrix(~.+0,data = testFrame) 

Adding +0 (or -1) to the model formula (for example, in lm ()) in R suppresses interception.

Have a look, please

0
Jul 27 '19 at 18:42
source share



All Articles