How to count the repeating repeating part of a sequence in R?

Is it possible to consider the repeating part of the sequence in R? For instance:

x<- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2, 3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4) 

Is it possible to calculate the time during which a subsequence of 3,0,3,1,3,2 occurs? So, in this example, it should be: 4

+4
source share
4 answers

I would do something like this:

 pattern <- c(3, 3.1, 3.2) len1 <- seq_len(length(x) - length(pattern) + 1) len2 <- seq_len(length(pattern))-1 sum(colSums(matrix(x[outer(len1, len2, '+')], ncol=length(len1), byrow=TRUE) == pattern) == length(len2)) 

PS: changing sum to which , you get the start of each instance.

+5
source

Another (general moving window):

 x <- c(1,3.0,3.1,3.2,1,1,2,3.0,3.1,3.2,4,4,5,6,5,3.0,3.1,3.2, 3.1,2,1,4,6,4.0,4,3.0,3.1,3.2,5,3.2,3.0,4) s <- c(3, 3.1, 3.2) sum(apply(embed(x, length(s)), 1, function(y) {all(y == rev(s))})) # [1] 4 

See the embed output to see what happens.

As Arun apply indicates, here is pretty slow, and you can use embed along with the Arun matrix trick to achieve this much faster:

 sum(colSums(matrix(embed(x, length(s)), byrow = TRUE, nrow = length(s)) == rev(s)) == length(s)) 
+3
source

You can turn it into a string and use gregexpr .

 sum(gregexpr("3 3.1 3.2", paste(x, collapse=" "), fixed=TRUE)[[1]] != -1) [1] 4 
+2
source

The Carl Witthoft seqle seqle may be useful for you here.

The function is as follows:

 seqle <- function(x,incr=1) { if(!is.numeric(x)) x <- as.numeric(x) n <- length(x) y <- x[-1L] != x[-n] + incr i <- c(which(y|is.na(y)),n) list(lengths = diff(c(0L,i)), values = x[head(c(0L,i)+1L,-1L)]) } 

For your data, it should look like this:

 temp <- seqle(x, incr=.1) temp # $lengths # [1] 1 3 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 # # $values # [1] 1.0 3.0 1.0 1.0 2.0 3.0 4.0 4.0 5.0 6.0 5.0 3.0 3.1 2.0 1.0 4.0 # [17] 6.0 4.0 4.0 3.0 5.0 3.2 3.0 4.0 

Now how do we read this? lengths tells us that our vector had a sequence of 1, then 3, then 1 and 1, and 1 and 3 .... values tells us that the first value of the sequence length 3 was “3.0”, the first value of the next sequence of length 3 was "3.0", etc.

It is easier to see as data.frame .

 data.frame(temp)[temp$lengths > 1, ] # lengths values # 2 3 3 # 6 3 3 # 12 3 3 # 20 3 3 

In this example, the lengths of all sequences are the same, and they start with the same value, so we can get your answer simply by looking at the number of lines in the data.frame received above.

+2
source

Source: https://habr.com/ru/post/1488692/


All Articles