Find which row interval in the data frame to which each element of the vector belongs.

Question

Find which row interval in the data frame to which each element of the vector belongs.

I have a vector of numeric elements and a data block with two columns that define the start and end points of the intervals. Each row in a data frame is a single interval. I want to know to what interval each element in the vector belongs.

Here are some sample data:

# Find which interval that each element of the vector belongs in library(tidyverse) elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervals <- frame_data(~phase, ~start, ~end, "a", 0, 0.5, "b", 1, 1.9, "c", 2, 2.5)

The same example data for those who mind tidyverse:

 elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervals <- structure(list(phase = c("a", "b", "c"), start = c(0, 1, 2), end = c(0.5, 1.9, 2.5)), .Names = c("phase", "start", "end"), row.names = c(NA, -3L), class = "data.frame")

Here is one way to do this:

  library(intrval) phases_for_elements <- map(elements, ~.x %[]% data.frame(intervals[, c('start', 'end')])) %>% map(., ~unlist(intervals[.x, 'phase']))

Here's the conclusion:

  [[1]] phase "a" [[2]] phase "a" [[3]] phase "a" [[4]] character(0) [[5]] phase "b" [[6]] phase "b" [[7]] phase "c"

But I'm looking for a simpler method with a smaller character set. I saw findInterval in related issues, but I'm not sure how I can use it in this situation.

+5

r dataframe intervals

Ben Dec 13 '16 at 23:03

source share

7 answers

cut may be useful here.

 out <- cut(elements, t(intervals[c("start","end")])) levels(out)[c(FALSE,TRUE)] <- NA intervals$phase[out] #[1] "a" "a" "a" NA "b" "b" "c"

+4

thelatemail Dec 14 '16 at 2:43

source share

David Arenburg mentioned the nonequilibrium associations, it really helped to understand what the general view of the problem is (thanks!). Now I see that it is not implemented for dplyr . Thanks to this answer , I see that there is a fuzzyjoin package that can do this in the same idiom. But this is hardly easier than my map solution above (although more readable, in my opinion), and does not contain a candle for the answer cut for brevity.

In my example above, the fuzzyjoin solution would be

 library(fuzzyjoin) library(tidyverse) fuzzy_left_join(data.frame(elements), intervals, by = c("elements" = "start", "elements" = "end"), match_fun = list(`>=`, `<=`)) %>% distinct()

What gives:

  elements phase start end 1 0.1 a 0 0.5 2 0.2 a 0 0.5 3 0.5 a 0 0.5 4 0.9 <NA> NA NA 5 1.1 b 1 1.9 6 1.9 b 1 1.9 7 2.1 c 2 2.5

+4

Ben Dec 14 '16 at 6:57

source share

Inspired by the @thelatemail cut solution, this uses findInterval , which still requires a lot of input:

 out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE) out[!(out %% 2)] <- NA intervals$phase[out %/% 2L + 1L] #[1] "a" "a" "a" NA "b" "b" "c"

Caution cut and findInterval have intervals with open intervals. Thus, solutions using cut and findInterval are not Bene equivalent using intrval , joining without equidistant David using data.table , and another solution using foverlaps .

+3

Uwe Dec 14 '16 at 6:42

source share

Just lapply works:

 l <- lapply(elements, function(x){ intervals$phase[x >= intervals$start & x <= intervals$end] }) str(l) ## List of 7 ## $ : chr "a" ## $ : chr "a" ## $ : chr "a" ## $ : chr(0) ## $ : chr "b" ## $ : chr "b" ## $ : chr "c"

or in purrr , if you are purrrfurrr,

 elements %>% map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>% # Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA) map_chr(~ifelse(length(.x) == 0, NA, .x)) ## [1] "a" "a" "a" NA "b" "b" "c"

+3

alistaire Dec 19 '16 at 9:38

source share

Here is a kind of “single line” that (mis-) uses the foverlaps from the data.table package, but the David non-equi connection is even more concise:

 library(data.table) #v1.10.0 foverlaps(data.table(start = elements, end = elements), setDT(intervals, key = c("start", "end"))) # phase start end i.start i.end #1: a 0 0.5 0.1 0.1 #2: a 0 0.5 0.2 0.2 #3: a 0 0.5 0.5 0.5 #4: NA NA NA 0.9 0.9 #5: b 1 1.9 1.1 1.1 #6: b 1 1.9 1.9 1.9 #7: c 2 2.5 2.1 2.1

+2

Uwe Dec 14 '16 at 7:05

source share

To complete the satellite, this is another way using the intervals package:

 library(tidyverse) elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1) intervalsDF <- frame_data( ~phase, ~start, ~end, "a", 0, 0.5, "b", 1, 1.9, "c", 2, 2.5 ) library(intervals) library(rlist) interval_overlap( Intervals(intervalsDF %>% select(-phase) %>% as.matrix, closed = c(TRUE, TRUE)), Intervals(data_frame(start = elements, end = elements), closed = c(TRUE, TRUE)) ) %>% list.map(data_frame(interval_index = .i, element_index = .)) %>% do.call(what = bind_rows) # A tibble: 6 × 2 # interval_index element_index # <int> <int> #1 1 1 #2 1 2 #3 1 3 #4 2 5 #5 2 6 #6 3 7

+2

Sainath adapa Dec 14 '16 at 8:33

source share

David Arenburg · Accepted Answer · 2016-12-13T23:32:10+0000

Here, a possible solution using the new "non-equi" is combined in data.table (v> = 1.9.8). Although I doubt you will like the syntax, it should be a very efficient solution.

Also, as far as findInterval , this function assumes continuity at your intervals, while it is not, so I doubt that there is a direct solution using it.

 library(data.table) #v1.10.0 setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)] # phase start end # 1: a 0.1 0.1 # 2: a 0.2 0.2 # 3: a 0.5 0.5 # 4: NA 0.9 0.9 # 5: b 1.1 1.1 # 6: b 1.9 1.9 # 7: c 2.1 2.1

Regarding the above code, I find it pretty straightforward: Attach the intervals and elements to the condition specified in the on statement. This is pretty much the case.

There is a certain caveat here, but start , end and elements must be of the same type, so if one of them is integer , it must first be converted to numeric .

Find which row interval in the data frame to which each element of the vector belongs.

More articles: