How to filter strings between two specific values

I need help filtering the following data frame (this is a simple example):

mx = as.data.frame(cbind(c("-", "-", "-", "-", "mutation", "+", "+", "+", "+") , c(F, T, F, F, F, F, T, F,T)) ) colnames(mx) = c("mutation", "distance") mx mutation distance 1 - FALSE 2 - TRUE 3 - FALSE 4 - FALSE 5 mutation FALSE 6 + FALSE 7 + TRUE 8 + FALSE 9 + TRUE 

I need to filter based on the second column (distance) so that it looks like this:

  mutation distance 3 - FALSE 4 - FALSE 5 mutation FALSE 6 + FALSE 

I need to delete all the lines until the last TRUE that is before the line with the value mx$mutation = mutation (so that lines 1 and 2), and all the lines after the first TRUE that occur after mx$mutation = mutation (therefore line 7 onwards )

+5
source share
3 answers

We can create a grouping variable by executing the cumulative sum of the logical column ("distance"), and then do a filter

 library(dplyr) mx %>% group_by(grp = cumsum(distance)) %>% filter(any(mutation == "mutation") & !distance) %>% ungroup %>% select(-grp) # A tibble: 4 x 2 # mutation distance # <fctr> <lgl> #1 - F #2 - F #3 mutation F #4 + F 

NOTE. We can directly create data.frame with data.frame . There is no need for cbind , and this will negatively affect the type of columns, since cbind converted to matrix and matrix can contain only one type

data

 mx = data.frame(c("-", "-", "-", "-", "mutation", "+", "+", "+", "+") , c(F, T, F, F, F, F, T, F,T)) 
+1
source

Hope this helps!

 #sample data (note that I have added few extra rows at the end) mx = data.frame(mutation = c("-", "-", "-", "-", "mutation", "+", "+", "+", "+", "-", "mutation", "+","+") , distance = c(F, T, F, F, F, F, T, F,T,F,F,F,T)) mutation_idx <- which(mx$mutation=="mutation") distance_T_idx <- which(mx$distance==T) interval_idx <- findInterval(mutation_idx, distance_T_idx) rows <- lapply(interval_idx, function(x) ((distance_T_idx[x]+1):(distance_T_idx[x+1]-1))) mx[unlist(rows),] 

Output:

  mutation distance 3 - FALSE 4 - FALSE 5 mutation FALSE 6 + FALSE 10 - FALSE 11 mutation FALSE 12 + FALSE 
0
source

You can use the which () method to correctly identify strings:

 # get rownum of last TRUE before df$mutation=mutation last_true_before_mutation <- max(which(mx$distance == 'TRUE')[which(mx$distance == 'TRUE') < which(mx$mutation == 'mutation')]) # get rownum of first TRUE after df$mutation=mutation first_true_after_mutation <- min(which(mx$distance == 'TRUE')[which(mx$distance == 'TRUE') > which(mx$mutation == 'mutation')]) # all rows to remove rem_rows <- c(seq(1:last_true_before_mutation), seq(first_true_after_mutation, nrow(mx))) # remove approproate rows mx[-rem_rows, ] 

enter image description here

Here is a universal function that you can use:

 before_after_mutation <- function(df) { last_true_before_mutation <- max(which(df$distance == 'TRUE')[which(df$distance == 'TRUE') < which(df$mutation == 'mutation')]) first_true_after_mutation <- min(which(df$distance == 'TRUE')[which(df$distance == 'TRUE') > which(df$mutation == 'mutation')]) rem_rows <- c(seq(1:last_true_before_mutation), seq(first_true_after_mutation, nrow(df))) res <- df[-rem_rows,] return(res) } 

Application:

 before_after_mutation(mx) 

enter image description here

0
source

Source: https://habr.com/ru/post/1274721/


All Articles