How to create a binary inventory matrix for each row? (R)

I have a 9 column dataframe consisting of an inventory of factors. Each row can contain all 9 columns (since this row contains 9 "things"), but most of them do not (most of them are between 3-4). The columns are also not specific, as if item 200 is displayed in columns 1 and 3, this is the same. I would like to create a matrix that is binary for each row that includes all factors.

Ex (reduced to 4 columns to get the exact point)

R1 3 4 5 8 R2 4 6 7 NA R3 1 5 NA NA R4 2 6 8 9 

Gotta turn into

  1 2 3 4 5 6 7 8 9 r1 0 0 1 1 1 0 0 1 0 r2 0 0 0 1 0 1 1 0 0 r3 1 0 0 0 1 0 0 0 0 r4 0 1 0 0 0 1 0 1 1 

I looked at writeBin / readBin, K-clustering (this is what I would like to do, but I need to get rid of NA first), fuzzy clustering, tag clustering. Just somehow lost about which direction to go.

I tried writing two loops that extract data from the matrix by column / row and then store 0 and 1 respectively in the new matrix, but I think there were problems with the area.

You guys are the best. Thanks!

+6
source share
3 answers

Here's the basic R solution:

 # Read in the data, and convert to matrix form df <- read.table(text = " 3 4 5 8 4 6 7 NA 1 5 NA NA 2 6 8 9", header = FALSE) m <- as.matrix(df) # Create a two column matrix containing row/column indices of cells to be filled # with 'one's id <- cbind(rowid = as.vector(t(row(m))), colid = as.vector(t(m))) id <- id[complete.cases(id), ] # Create output matrix out <- matrix(0, nrow = nrow(m), ncol = max(m, na.rm = TRUE)) out[id] <- 1 # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] # [1,] 0 0 1 1 1 0 0 1 0 # [2,] 0 0 0 1 0 1 1 0 0 # [3,] 1 0 0 0 1 0 0 0 0 # [4,] 0 1 0 0 0 1 0 1 1 
+5
source

This should do the trick:

 # The Incantation options(stringsAsFactors = FALSE) library(reshape2) # Your example data dat <- data.frame(id = c("R1", "R2", "R3", "R4"), col1 = c(3, 4, 1, 2), col2 = c(4, 6, 5, 6), col3 = c(5, 7, NA, 7), col4 = c(8, NA, NA, 9) ) # Melt it down dat.melt <- melt(dat, id.var = "id") # Cast it back out, with the row IDs remaining the row IDs # and the values of the columns becoming the columns themselves. # dcast() will default to length to aggregate records - which means # that the values in this data.frame are a count of how many times # each value occurs in each row columns (which, based on this data, # seems to be capped at just once). dat.cast <- dcast(dat.melt, id ~ value) 

Result:

 dat.cast id 1 2 3 4 5 6 7 8 9 NA 1 R1 0 0 1 1 1 0 0 1 0 0 2 R2 0 0 0 1 0 1 1 0 0 1 3 R3 1 0 0 0 1 0 0 0 0 2 4 R4 0 1 0 0 0 1 1 0 1 0 
+3
source

These are all great answers. I thought that I would have made my initial decision, I wrote that my friend is modified to actually work.

 for(i in seq(nrow(x))) for(j in seq(ncol(x))) if(!is.na(x[i,j])) { y[i, x[i,j]] = 1 } 

Two for loops work after setting some earlier parameters, but this is incredibly slow. It seems that these other solutions are much faster!

+1
source

Source: https://habr.com/ru/post/906723/


All Articles