How to find the index of a second, non-sequential occurrence of a value in a vector using R?

Question

How to find the index of a second, non-sequential occurrence of a value in a vector using R?

I need to find the index of the second, inconsistent, occurrence of a value in a vector.

Some examples of vectors:

Example a) 1 1 1 2 3 4 1 1 1 2 3 4

Example b) 1 2 3 1 1 1 3 5

Please note that vectors can have a different number of occurrences of each value and very large (more than 100,000 entries)

So, if the value in question is 1, in example a) the result should return the 7th position, and b) should return the 4th.

Thanks in advance for any help or advice you can provide.

Code Examples:

exampleA<-c(1, 1, 1, 2, 3, 4, 1, 1, 1, 2, 3, 4)
exampleB<-c(1, 2, 3, 1, 1, 1, 3, 5)

+4

r

Marcus morrisey Feb 26 '14 at 17:38

source share

5 answers

, which diff :

x <- which(a == 1)
x[which(diff(x) != 1)[1] + 1]
# [1] 7
y <- which(b == 1)
y[which(diff(y) != 1)[1] + 1]
# [1] 4

:

findFirst <- function(invec, value, event) {
  x <- which(invec == value)
  if (event == 1) out <- x[1]
  else out <- x[which(diff(x) != 1)[event-1] + 1]
  out
}

invec - .
value - , .
event - (, , , ).

:

findFirst(a, 1, 2)   ## event is the occurrence you want to get

:

set.seed(1)
a <- sample(25, 1e7, replace = TRUE)
findFirst(a, 10, 2)
# [1] 14
find.index(a, 10)
# [1] 14
op(a, 10)
# [1] 14

library(microbenchmark)
microbenchmark(findFirst(a, 10, 2), find.index(a, 10), op(a, 10), times = 5)
# Unit: milliseconds
#                 expr       min        lq    median        uq       max neval
#  findFirst(a, 10, 2)  281.6979  284.3281  301.6595  380.9089  414.9640     5
#    find.index(a, 10) 3268.0227 3312.0002 3372.3713 3444.7334 3769.0176     5
#            op(a, 10)  272.7325  278.3369  280.3172  286.0758  293.6699     5

+3

A5C1D2H2I1M1N2O1R2T1 26 . '14 17:45

R-, , Rcpp, , , .

find.index.3 <- function(vec, val) {
  seq_val <- 0
  last_val <- NA
  for(i in seq_along(vec)) {
    if(identical(vec[[i]], val) & !identical(last_val, val_to_match))
      if(identical(seq_val <- seq_val + 1, 2)) break
    last_val <- vec[[i]]
  }
  i
}
library(microbenchmark)
microbenchmark(find.index.3(a, 10L), find.second(a, 10))
# Unit: milliseconds
#                  expr       min        lq    median        uq      max neval
#  find.index.3(a, 10L)  5.650716  5.877447  6.095766  8.003047 106.4033   100
#    find.second(a, 10) 15.758154 18.143398 18.934030 20.247239 118.1735   100

, . , . , identical() (EDIT: == ), , .

EDIT:

Rcpp , . a, 10000 25, :

# Unit: milliseconds
#                  expr      min       lq   median       uq      max neval
#  find.index.3(a, 10L) 80.50039 83.23213 84.27801 85.43654 186.4049   100
#    find.second(a, 10) 17.06515 19.38969 20.52041 23.52533 125.8619   100

+3

BrodieG 26 . '14 20:34

:

op <- function(v, x){ # v=vector, x=value
    w <- which(v==x) # 1)
    s <- seq(w[1],length.out=length(w)) # 2)
    return(w[which(w!=s)[1]]) # 3)
}

> exampleA <- c(1, 1, 1, 2, 3, 4, 1, 1, 1, 2, 3, 4)
> exampleB <- c(1, 2, 3, 1, 1, 1, 3, 5)
> op(exampleA, 1)
[1] 7
> op(exampleB, 1)
[1] 4

, x.
s , x.
w==s=TRUE - , , , w!=s, .

+2

Julián Urbano 26 . '14 18:07

If speed is a big factor here (and reading the original post looks like it could be), then a custom solution using Rcpp is likely to be faster than any of the pure-R approaches published so far:

library(Rcpp)
find.second = cppFunction(
"int findSecond(NumericVector x, const int value) {
    bool startFirst = false;
    bool inFirst = false;
    for (int i=0; i < x.size(); ++i) {
        if (x[i] == value) {
            if (!startFirst) {
                startFirst = true;
                inFirst = true;
            } else if (!inFirst) {
                return i+1;
            }
        } else {
            inFirst = false;
        }
    }
    return -1;
}")

The following are @AnandMahto tests extended to include find.second:

set.seed(1)
a <- sample(25, 1e7, replace = TRUE)
findFirst(a, 10, 2)
# [1] 14
find.index(a, 10)
# [1] 14
op(a, 10)
# [1] 14
find.second(a, 10)
# [1] 14

microbenchmark(findFirst(a, 10, 2), find.index(a, 10), op(a, 10), find.second(a, 10), times = 5)
# Unit: milliseconds
#                 expr        min         lq     median         uq        max neval
#  findFirst(a, 10, 2)   79.00000   93.85400   96.80120  118.32011  121.56636     5
#    find.index(a, 10) 1620.83892 1673.72124 1689.06826 1747.42781 2145.90346     5
#            op(a, 10)   78.54637   83.71081   94.20531   97.30813  195.78469     5
#   find.second(a, 10)   14.57835   24.36220   25.24104   36.57584   47.45959     5

+2

josliber Feb 26 '14 at 20:15

source share

josliber · Accepted Answer · 2014-02-26T17:46:31+0000

Encoding the path length vector can be useful with such sorts:

find.index <- function(x, value) {
  r <- rle(x)
  match.pos <- which(r$value == value)
  if (length(match.pos) < 2) {
    return(NA)  # There weren't two sequential sets of observations
  }
  return(sum(r$length[1:(match.pos[2]-1)])+1)
}

# Test it out
a <- c(1, 1, 1, 2, 3, 4, 1, 1, 1, 2, 3, 4)
b <- c(1, 2, 3, 1, 1, 1, 3, 5)
find.index(a, 1)
# [1] 7
find.index(b, 1)
# [1] 4
find.index(b, 5)
# [1] NA

How to find the index of a second, non-sequential occurrence of a value in a vector using R?

More articles: