Transcode data based on a single column

I have a data frame 5845 * 1095 (rows * columns) that looks like this:

9 286593 CC/CC/AA/A 9 334337 AA/AG/AA/A 9 390512 CC/CC/CC/C c <- c("9", "286593", "C", "C/C", "C/A", "A/A") d <- c("9", "334337", "A", "A/A", "G/A", "A/A") e <- c("9", "390512", "C", "C/C", "C/C", "C/C") dat <- data.frame(rbind(c,d,e)) 

I want the values ​​in the third column to be used to change the columns to the right, so if (for each row 1) column 3 is β€œC”, then column 4 rotates from β€œC / C” to β€œ0” because it has one and the same letter. One letter match is "1" (may be the first or second letter), and the match of the letters is not equal to "2".

 9 286593 C 0 1 2 9 334337 A 0 1 0 9 390512 C 0 0 0 c <- c("9", "286593", "C", "0", "1", "2") d <- c("9", "334337", "A", "0", " 1", "0") e <- c("9", "390512", "C", "0", "0", "0") dat <- data.frame(rbind(c,d,e)) 

I am interested to see the best way to do this, as I want to give up the habit of using nested For loops in R.

+6
source share
6 answers

First your data:

 c <- c("9", "286593", "C", "C/C", "C/A", "A/A") # Note: In your original data, you had a space in "G/A", which I did remove. # If this was no mistake, we would also have to deal with the space. d <- c("9", "334337", "A", "A/A", "G/A", "A/A") e <- c("9", "390512", "C", "C/C", "C/C", "C/C") dat <- data.frame(rbind(c,d,e)) 

Now we will generate a vector containing all possible letters.

 values <- c("A", "C", "G", "T") dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. # Generate all possible combinations of two letters combinations <- expand.grid(f=values, s=values) combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/'))) 

The main function finds the correct columns of each combination of each column, and then compares it with reference column 3.

 compare <- function(col, val) { m <- match(col, combinations$v) 2 - (combinations$f[m] == val) - (combinations$s[m] == val) } 

Finally, we use apply to run the function for all columns that need to be changed. You probably want to change 6 to the actual number of columns.

 dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3]) 

Please note that this solution, compared to other solutions, still does not use string comparison, but an approach based solely on factors. It would be interesting to see which one works best.

Edit

I just did benchmarking:

  test replications elapsed relative user.self sys.self user.child sys.child 1 arun 1000000 2.881 1.116 2.864 0.024 0 0 2 fabio 1000000 2.593 1.005 2.558 0.030 0 0 3 roland 1000000 2.727 1.057 2.687 0.048 0 0 5 thilo 1000000 2.581 1.000 2.540 0.036 0 0 4 tyler 1000000 2.663 1.032 2.626 0.042 0 0 

which leaves my version a little faster. However, the difference means almost nothing, so you are probably well versed in each approach. And honestly, I did not compare the part where I add additional levels of factors. Doing this will also probably print my version.

+5
source

Here is one approach:

 FUN <- function(x) { a <- strsplit(as.character(unlist(x[-1])), "/") b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1])))) 2 - b } dat[4:6] <- t(apply(dat[, 3:6], 1, FUN)) ## > dat ## X1 X2 X3 X4 X5 X6 ## c 9 286593 C 0 1 2 ## d 9 334337 A 0 1 0 ## e 9 390512 C 0 0 0 
+4
source

Here is one way: apply :

 out <- apply(dat[, -(1:2)], 1, function(x) 2 - grepl(x[1], x[-1]) - x[-1] %in% paste(x[1], x[1], sep="/")) cbind(dat[, (1:3)], t(out)) 
+4
source

This solution is not very efficient:

 dat <- cbind(dat[,-(4:6)], t(sapply(seq_len(nrow(dat)),function(i){ res <- dat[i,] res[,4:6] <- lapply(res[,4:6],function(x) 2-sum(gregexpr(res[,3],x)[[1]]>0)) }))) # X1 X2 X3 X4 X5 X6 #c 9 286593 C 0 1 2 #d 9 334337 A 0 1 0 #e 9 390512 C 0 0 0 
+3
source

Ugly, but it works!

 fff<-apply(dat[,4:ncol(dat)],2,substr,1,1)!=dat[,3] ggg<-apply(dat[,4:ncol(dat)],2,substr,3,3)!=dat[,3] final<-fff+ggg cbind(dat,final) X1 X2 X3 X4 X5 X6 X4 X5 X6 c 9 286593 CC/CC/AA/A 0 1 2 d 9 334337 AA/AG/AA/A 0 1 0 e 9 390512 CC/CC/CC/C 0 0 0 
+2
source

Another contribution to R-golf:

 cbind(dat[, 1:3], apply(dat[, -(1:3)], 2, function(x) { 2 - (dat[[3]] == gsub('..$', '', x)) - (dat[[3]] == gsub('^..', '', x)) })) 
+2
source

Source: https://habr.com/ru/post/947996/


All Articles