Find which row interval in the data frame to which each element of the vector belongs.

I have a vector of numeric elements and a data block with two columns that define the start and end points of the intervals. Each row in a data frame is a single interval. I want to know to what interval each element in the vector belongs.

Here are some sample data:

# Find which interval that each element of the vector belongs in library(tidyverse) elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervals <- frame_data(~phase, ~start, ~end, "a", 0, 0.5, "b", 1, 1.9, "c", 2, 2.5) 

The same example data for those who mind tidyverse:

 elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervals <- structure(list(phase = c("a", "b", "c"), start = c(0, 1, 2), end = c(0.5, 1.9, 2.5)), .Names = c("phase", "start", "end"), row.names = c(NA, -3L), class = "data.frame") 

Here is one way to do this:

  library(intrval) phases_for_elements <- map(elements, ~.x %[]% data.frame(intervals[, c('start', 'end')])) %>% map(., ~unlist(intervals[.x, 'phase'])) 

Here's the conclusion:

  [[1]] phase "a" [[2]] phase "a" [[3]] phase "a" [[4]] character(0) [[5]] phase "b" [[6]] phase "b" [[7]] phase "c" 

But I'm looking for a simpler method with a smaller character set. I saw findInterval in related issues, but I'm not sure how I can use it in this situation.

+5
source share
7 answers

Here, a possible solution using the new "non-equi" is combined in data.table (v> = 1.9.8). Although I doubt you will like the syntax, it should be a very efficient solution.

Also, as far as findInterval , this function assumes continuity at your intervals, while it is not, so I doubt that there is a direct solution using it.

 library(data.table) #v1.10.0 setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)] # phase start end # 1: a 0.1 0.1 # 2: a 0.2 0.2 # 3: a 0.5 0.5 # 4: NA 0.9 0.9 # 5: b 1.1 1.1 # 6: b 1.9 1.9 # 7: c 2.1 2.1 

Regarding the above code, I find it pretty straightforward: Attach the intervals and elements to the condition specified in the on statement. This is pretty much the case.

There is a certain caveat here, but start , end and elements must be of the same type, so if one of them is integer , it must first be converted to numeric .

+14
source

cut may be useful here.

 out <- cut(elements, t(intervals[c("start","end")])) levels(out)[c(FALSE,TRUE)] <- NA intervals$phase[out] #[1] "a" "a" "a" NA "b" "b" "c" 
+4
source

David Arenburg mentioned the nonequilibrium associations, it really helped to understand what the general view of the problem is (thanks!). Now I see that it is not implemented for dplyr . Thanks to this answer , I see that there is a fuzzyjoin package that can do this in the same idiom. But this is hardly easier than my map solution above (although more readable, in my opinion), and does not contain a candle for the answer cut for brevity.

In my example above, the fuzzyjoin solution would be

 library(fuzzyjoin) library(tidyverse) fuzzy_left_join(data.frame(elements), intervals, by = c("elements" = "start", "elements" = "end"), match_fun = list(`>=`, `<=`)) %>% distinct() 

What gives:

  elements phase start end 1 0.1 a 0 0.5 2 0.2 a 0 0.5 3 0.5 a 0 0.5 4 0.9 <NA> NA NA 5 1.1 b 1 1.9 6 1.9 b 1 1.9 7 2.1 c 2 2.5 
+4
source

Inspired by the @thelatemail cut solution, this uses findInterval , which still requires a lot of input:

 out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE) out[!(out %% 2)] <- NA intervals$phase[out %/% 2L + 1L] #[1] "a" "a" "a" NA "b" "b" "c" 

Caution cut and findInterval have intervals with open intervals. Thus, solutions using cut and findInterval are not Bene equivalent using intrval , joining without equidistant David using data.table , and another solution using foverlaps .

+3
source

Just lapply works:

 l <- lapply(elements, function(x){ intervals$phase[x >= intervals$start & x <= intervals$end] }) str(l) ## List of 7 ## $ : chr "a" ## $ : chr "a" ## $ : chr "a" ## $ : chr(0) ## $ : chr "b" ## $ : chr "b" ## $ : chr "c" 

or in purrr , if you are purrrfurrr,

 elements %>% map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>% # Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA) map_chr(~ifelse(length(.x) == 0, NA, .x)) ## [1] "a" "a" "a" NA "b" "b" "c" 
+3
source

Here is a kind of โ€œsingle lineโ€ that (mis-) uses the foverlaps from the data.table package, but the David non-equi connection is even more concise:

 library(data.table) #v1.10.0 foverlaps(data.table(start = elements, end = elements), setDT(intervals, key = c("start", "end"))) # phase start end i.start i.end #1: a 0 0.5 0.1 0.1 #2: a 0 0.5 0.2 0.2 #3: a 0 0.5 0.5 0.5 #4: NA NA NA 0.9 0.9 #5: b 1 1.9 1.1 1.1 #6: b 1 1.9 1.9 1.9 #7: c 2 2.5 2.1 2.1 
+2
source

To complete the satellite, this is another way using the intervals package:

 library(tidyverse) elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervalsDF <- frame_data( ~phase, ~start, ~end, "a", 0, 0.5, "b", 1, 1.9, "c", 2, 2.5 ) library(intervals) library(rlist) interval_overlap( Intervals(intervalsDF %>% select(-phase) %>% as.matrix, closed = c(TRUE, TRUE)), Intervals(data_frame(start = elements, end = elements), closed = c(TRUE, TRUE)) ) %>% list.map(data_frame(interval_index = .i, element_index = .)) %>% do.call(what = bind_rows) # A tibble: 6 ร— 2 # interval_index element_index # <int> <int> #1 1 1 #2 1 2 #3 1 3 #4 2 5 #5 2 6 #6 3 7 
+2
source

Source: https://habr.com/ru/post/1261227/


All Articles