Transcode data based on a single column

Question

Transcode data based on a single column

I have a data frame 5845 * 1095 (rows * columns) that looks like this:

9 286593 CC/CC/AA/A 9 334337 AA/AG/AA/A 9 390512 CC/CC/CC/C c <- c("9", "286593", "C", "C/C", "C/A", "A/A") d <- c("9", "334337", "A", "A/A", "G/A", "A/A") e <- c("9", "390512", "C", "C/C", "C/C", "C/C") dat <- data.frame(rbind(c,d,e))

I want the values in the third column to be used to change the columns to the right, so if (for each row 1) column 3 is “C”, then column 4 rotates from “C / C” to “0” because it has one and the same letter. One letter match is "1" (may be the first or second letter), and the match of the letters is not equal to "2".

 9 286593 C 0 1 2 9 334337 A 0 1 0 9 390512 C 0 0 0 c <- c("9", "286593", "C", "0", "1", "2") d <- c("9", "334337", "A", "0", " 1", "0") e <- c("9", "390512", "C", "0", "0", "0") dat <- data.frame(rbind(c,d,e))

I am interested to see the best way to do this, as I want to give up the habit of using nested For loops in R.

+6

r dataframe

cianius Jun 24 '13 at 14:52

source share

6 answers

Here is one approach:

 FUN <- function(x) { a <- strsplit(as.character(unlist(x[-1])), "/") b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1])))) 2 - b } dat[4:6] <- t(apply(dat[, 3:6], 1, FUN)) ## > dat ## X1 X2 X3 X4 X5 X6 ## c 9 286593 C 0 1 2 ## d 9 334337 A 0 1 0 ## e 9 390512 C 0 0 0

+4

Tyler rinker Jun 24 '13 at 15:25

source share

Here is one way: apply :

 out <- apply(dat[, -(1:2)], 1, function(x) 2 - grepl(x[1], x[-1]) - x[-1] %in% paste(x[1], x[1], sep="/")) cbind(dat[, (1:3)], t(out))

+4

Arun Jun 24 '13 at 15:31

source share

This solution is not very efficient:

 dat <- cbind(dat[,-(4:6)], t(sapply(seq_len(nrow(dat)),function(i){ res <- dat[i,] res[,4:6] <- lapply(res[,4:6],function(x) 2-sum(gregexpr(res[,3],x)[[1]]>0)) }))) # X1 X2 X3 X4 X5 X6 #c 9 286593 C 0 1 2 #d 9 334337 A 0 1 0 #e 9 390512 C 0 0 0

+3

Rolling Jun 24 '13 at 15:26

source share

Ugly, but it works!

 fff<-apply(dat[,4:ncol(dat)],2,substr,1,1)!=dat[,3] ggg<-apply(dat[,4:ncol(dat)],2,substr,3,3)!=dat[,3] final<-fff+ggg cbind(dat,final) X1 X2 X3 X4 X5 X6 X4 X5 X6 c 9 286593 CC/CC/AA/A 0 1 2 d 9 334337 AA/AG/AA/A 0 1 0 e 9 390512 CC/CC/CC/C 0 0 0

+2

Fabio marroni Jun 24 '13 at 15:28

source share

Another contribution to R-golf:

 cbind(dat[, 1:3], apply(dat[, -(1:3)], 2, function(x) { 2 - (dat[[3]] == gsub('..$', '', x)) - (dat[[3]] == gsub('^..', '', x)) }))

+2

eddi Jun 24 '13 at 20:21

source share

Thilo · Accepted Answer · 2013-06-24T15:24:24+0000

First your data:

 c <- c("9", "286593", "C", "C/C", "C/A", "A/A") # Note: In your original data, you had a space in "G/A", which I did remove. # If this was no mistake, we would also have to deal with the space. d <- c("9", "334337", "A", "A/A", "G/A", "A/A") e <- c("9", "390512", "C", "C/C", "C/C", "C/C") dat <- data.frame(rbind(c,d,e))

Now we will generate a vector containing all possible letters.

 values <- c("A", "C", "G", "T") dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. # Generate all possible combinations of two letters combinations <- expand.grid(f=values, s=values) combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))

The main function finds the correct columns of each combination of each column, and then compares it with reference column 3.

 compare <- function(col, val) { m <- match(col, combinations$v) 2 - (combinations$f[m] == val) - (combinations$s[m] == val) }

Finally, we use apply to run the function for all columns that need to be changed. You probably want to change 6 to the actual number of columns.

 dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])

Please note that this solution, compared to other solutions, still does not use string comparison, but an approach based solely on factors. It would be interesting to see which one works best.

Edit

I just did benchmarking:

  test replications elapsed relative user.self sys.self user.child sys.child 1 arun 1000000 2.881 1.116 2.864 0.024 0 0 2 fabio 1000000 2.593 1.005 2.558 0.030 0 0 3 roland 1000000 2.727 1.057 2.687 0.048 0 0 5 thilo 1000000 2.581 1.000 2.540 0.036 0 0 4 tyler 1000000 2.663 1.032 2.626 0.042 0 0

which leaves my version a little faster. However, the difference means almost nothing, so you are probably well versed in each approach. And honestly, I did not compare the part where I add additional levels of factors. Doing this will also probably print my version.

Transcode data based on a single column

Edit

More articles: