Get the value of the last non-empty column for each row

Take the data from this sample:

data.frame(a_1=c("Apple","Grapes","Melon","Peach"),a_2=c("Nuts","Kiwi","Lime","Honey"),a_3=c("Plum","Apple",NA,NA),a_4=c("Cucumber",NA,NA,NA)) a_1 a_2 a_3 a_4 1 Apple Nuts Plum Cucumber 2 Grapes Kiwi Apple <NA> 3 Melon Lime <NA> <NA> 4 Peach Honey <NA> <NA> 

Basically, I want to run grep in the last column of every row that is not NA. So my x in grep ("pattern", x) should be:

 Cucumber Apple Lime Honey 

I have an integer that tells me which a_N is the last:

 numcol <- rowSums(!is.na(df[,grep("(^a_)\\d", colnames(df))])) 

So far I have tried something similar in combination with ave (), apply () and dplyr:

 grepl("pattern",df[,sprintf("a_%i",numcol)]) 

However, I cannot get it to work. Keep in mind that my dataset is very large, so I was hoping it would be a vector solution or mb dplyr. Help would be greatly appreciated.

/ e: Thank you, this is a really good solution. My thinking was too complicated. (regex associated with my more specific data)

+5
source share
3 answers

There is no need for regular expression. Just use apply + tail + na.omit :

 > apply(mydf, 1, function(x) tail(na.omit(x), 1)) [1] "Cucumber" "Apple" "Lime" "Honey" 

I don’t know how it compares in terms of speed, but you . You can also use a combination of "data.table" and "reshape2", for example:

 library(data.table) library(reshape2) na.omit(melt(as.data.table(mydf, keep.rownames = TRUE), id.vars = "rn"))[, value[.N], by = rn] # rn V1 # 1: 1 Cucumber # 2: 2 Apple # 3: 3 Lime # 4: 4 Honey 

Or even better:

 melt(as.data.table(df, keep.rownames = TRUE), id.vars = "rn", na.rm = TRUE)[, value[.N], by = rn] # rn V1 # 1: 1 Cucumber # 2: 2 Apple # 3: 3 Lime # 4: 4 Honey 

It will be much faster. In the dataset, 800 thousand. Apply lines took ~ 50 seconds, and the data.table approach took about 2.5 seconds.

+8
source

Another alternative that can be pretty quick:

 DF[cbind(seq_len(nrow(DF)), max.col(!is.na(DF), "last"))] #[1] "Cucumber" "Apple" "Lime" "Honey" 

Where is the "DF":

 DF = structure(list(a_1 = structure(1:4, .Label = c("Apple", "Grapes", "Melon", "Peach"), class = "factor"), a_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("Honey", "Kiwi", "Lime", "Nuts"), class = "factor"), a_3 = structure(c(2L, 1L, NA, NA), .Label = c("Apple", "Plum" ), class = "factor"), a_4 = structure(c(1L, NA, NA, NA), .Label = "Cucumber", class = "factor")), .Names = c("a_1", "a_2", "a_3", "a_4"), row.names = c(NA, -4L), class = "data.frame") 
+3
source

You can also try: ( df1 is a dataset)

  indx <- which(!is.na(df1), arr.ind=TRUE) df1[cbind(1:nrow(df1),tapply(indx[,2], indx[,1], FUN=max))] #[1] "Cucumber" "Apple" "Lime" "Honey" 
0
source

Source: https://habr.com/ru/post/1202710/


All Articles