Direct creation of a dummy variable defined in a sparse matrix in R

Question

Direct creation of a dummy variable defined in a sparse matrix in R

Suppose you have a data frame with a large number of columns (1000 factors, each of which has 15 levels). You want to create a data set with variable variables, but since it will be too meager, you would like to save the mannequins in a sparse matrix format.

My dataset is quite large, and the fewer steps, the better for me. I know how to do the above steps; but I could not come up with directly creating this sparse matrix from the original dataset, i.e. having one step instead of two. Any ideas?

EDIT: some comments require further development, so here it is:

Where X is my original dataset with 1000 columns and 50,000 records, each column has 15 levels,

Step 1: Create dummy variables from the source dataset using code:

# Creating dummy data set with empty values
dummified <- matrix(NA,nrow(X),15*ncol(X))
# Adding values to this data set for each column and each level within columns
for (i in 1:ncol(X)){colFactr <- factor(X[,i],exclude=NULL)
  for (j in 1:l){
    lvl <- levels(colFactr)[j]
    indx <- ((i-1)*l)+j
    dummified[,indx] <- ifelse(colFactr==lvl,1,0)
  }
}

Step 2: transform this huge matrix into a sparse matrix with this code:

sparse.dummified <- sparseMatrix(dummified)

But this approach still created this intermediate large matrix, which takes a lot of time and memory, so I am setting a direct methodology (if any).

+4

matrix r sparse-matrix r-factor

agondiken Apr 12 '14 at 20:50

source share

3 answers

Matrix:::sparse.model.matrix, .

:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))

, :

library(Matrix)
sparse.model.matrix(~.-1,data=df)

:

fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
do.call(cBind,mList)

+3

Ben Bolker 13 . '14 14:53

Adding a comment as an answer as it seems a little faster and more scalable (at least on my PC (ubuntu R3.1.0)

Matrix(model.matrix(~ -1 + . , data=df, 
         contrasts.arg = lapply(df, contrasts, contrasts=FALSE)),sparse=TRUE)

Big data test

library(Matrix)
library(microbenchmark)

set.seed(123)
df <- data.frame(replicate(200,sample(letters[1:15], 100, TRUE)))

ben <- function() {
  fList <- lapply(names(df),reformulate,intercept=FALSE)
  do.call(cBind,lapply(fList,sparse.model.matrix,data=df))
  }


flodel <- function(){
  do.call(cBind,lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                        j = as.integer(j), x = 1)))    
   }


user <- function(){
  Matrix(model.matrix(~ -1 + . , data=df, 
                  contrasts.arg = lapply(df, contrasts, contrasts=FALSE)),
     sparse=TRUE)
   }


    microbenchmark(flodel(), flodel2(), ben(), user(),times=10)
# Unit: milliseconds
 #     expr        min         lq    median         uq        max neval
  # flodel() 1002.79714 1005.70631 1100.1874 1179.84403 1192.56583    10
  # flodel2()   16.62579   17.37707   18.5620   18.72137   19.19888    10
  #     ben() 1602.80193 1612.45177 1616.6684 1703.16246 1709.90557    10
  #    user()   96.80575   97.37132  101.9881  104.00750  195.87784    10

Change adding to flodel update - its clear - v. nice

+3

user20650 Apr 13 '14 at 15:54

source share

flodel · Accepted Answer · 2014-04-13T11:45:54+0000

Thanks for clarifying your question, try this.

Here is sample data with two columns having three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
#   x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                          j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#               
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @ user20650 indicated that do.call(cBind, ...)sluggish or not working with big data. Thus, this is a more complex, but quick and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)

Direct creation of a dummy variable defined in a sparse matrix in R

More articles: