[New solution]
Fast and very simple dplyr solution for k=1 . fC1 below considers the connections in the same way, i.e. Does not bind the gap. You will see that you can impose any tie-break rule on it. And it is really fast.
library(dplyr) fC1 <- function(dat){ dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()-1) %>% ungroup() %>% slice(2:n()-1) %>% filter(count!=0) %>% mutate(z=cummax(count)) %>% filter(count==z) z <- dat1$z length(z) } set.seed(1234) dat<-data.table(sample(1:5000, 100000, replace=T)) system.time(a1 <- fC1(dat))[3]
You can freely impose a tie-break rule on the result of fC1 to come up with other solutions. For example, to arrive at solutions f3m or f3 , we restrict the selection of some lines as follows:
fC1_ <- function(dat){ b <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()-1) %>% ungroup() %>% mutate(L=cummax(count+1))# %>% b1 <- b %>% slice(2:(n()-1)) %>% group_by(L) %>% slice(1) %>% filter(count+1>=L& count>0) b2 <- b %>% group_by(L) %>% slice(1) %>% ungroup() %>% select(-L) %>% mutate(L=count) semi_join(b1, b2, by=c("V1", "L")) %>% nrow } set.seed(1234) dat <- data.table(sample(1:50,10000,replace=T)) fC1_(dat) #[1] 218 f3m(dat, 1) #[1] 217 f3(dat, 1) #[1] 218
and for an earlier example
set.seed(1234) dat<-data.table(sample(1:5000, 100000, replace=T)) system.time(fC1_(dat))[3];fC1_(dat)
Somehow, I could not extend the solution to a generic k>1 , so I resorted to Rcpp.
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> countrank(std::vector<int> y, int k) { std::vector<int> v(y.begin(), y.begin() + k); std::make_heap(v.begin(), v.end()); std::vector<int> count(y.size()); for(int i=0; i < y.size(); i++){ if(y[i]==0){count[i]=0;} else{ v.push_back(y[i]); std::push_heap(v.begin(), v.end()); std::pop_heap(v.begin(), v.end()); v.pop_back(); std::vector<int>::iterator it = std::find (v.begin(), v.end(), y[i]); if (it != v.end()) {count[i]=1;}; } } return count; }
For k=1 it is worth noting that fC1 no less fast than the next Rcpp version of fCpp .
fCpp <- function(dat, k) { dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()) x <- dat1$V1 y <- dat1$count-1 z <- countrank(-y, k) sum(z[2:(nrow(dat)-1)]) }
Again, you can impose any tie-break rule with minimal effort.
[ f3, f3m functions]
f3 owned by @Marat Talipov, and f3m is some kind of amendment to it (it seems to seem unnecessary).
f3m <- function(dat, k){ n <- nrow(dat) dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()) x <- dat1$V1 y <- dat1$count rank <- rep(NA, n) tablex <- numeric(max(x)) for(i in 2:(n-1)){ if(y[i]==1){rank[i]=NA}
Refer to change history for an earlier solution.