Gradually find the most used item in the list in R

I would like to look at the list and check if this element is the most frequent element in the list to this point. The solution I have now is incredibly slow compared to Python. Is there an effective way to speed it up?

dat<-data.table(sample(1:50,10000,replace=T)) k<-1 correct <- 0 # total correct predictions for (i in 2:(nrow(dat)-1)) { if (dat[i,V1] %in% dat[1:(i-1),.N,by=V1][order(-N),head(.SD,k)][,V1]) { correct <- correct + 1 } } 

More generally, I would eventually like to see if the item is one of the most frequent items to the point or if it has one of the highest k values โ€‹โ€‹to the point.

For comparison, here is a very fast implementation in Python:

 dat=[random.randint(1,50) for i in range(10000)] correct=0 k=1 list={} for i in dat: toplist=heapq.nlargest(k,list.iteritems(),key=operator.itemgetter(1)) toplist=[j[0] for j in toplist] if i in toplist: correct+=1 if list.has_key(i): list[i]=list[i]+1 else: list[i]=1 
+6
source share
4 answers

Here is what I have so far (my solution is f3):

 set.seed(10) dat<-data.table(sample(1:3,100,replace=T)) k<-1 f3 <- function(dat) { correct <- 0 # total correct predictions vf <- factor(dat$V1) v <- as.integer(vf) tabs <- integer(max(v)) for (i in 2:(nrow(dat)-1)) { tabs[v[i-1]] <- tabs[v[i-1]] + 1 #print(tabs) #print(v[1:i]) if (match(v[i],order(tabs,decreasing = T))<=k) { correct <- correct + 1 } #print(correct) #print('') } correct } f1 <- function(dat) { correct <- 0 # total correct predictions for (i in 2:(nrow(dat)-1)) { if (dat[i,V1] %in% dat[1:(i-1),.N,by=V1][order(-N),head(.SD,k)]) { correct <- correct + 1 } } correct } library(rbenchmark) print(f1(dat)==f3(dat)) library(rbenchmark) benchmark(f1(dat),f3(dat),replications=10) 

Test results:

  test replications elapsed relative user.self sys.self user.child sys.child 1 f1(dat) 10 2.939 163.278 2.931 0.008 0 0 2 f3(dat) 10 0.018 1.000 0.018 0.000 0 0 

encouraging, but f3 has two problems:

  • It doesnโ€™t always give the same answer as the OP algorithm, because relationships are handled differently,

  • There are many opportunities for improvement, because the tabs are sorted anew every time.

+3
source

The condition is automatically valid until k + 1 values โ€‹โ€‹are noted:

 startrow <- dat[,list(.I,.GRP),by=V1][.GRP==k+1]$.I[1] correct <- rep(0L,length(v)) correct[1:(startrow-1)] <- 1L 

You can pre-copy the number of appearances whose value V1 has had so far:

 ct <- dat[,ct:=1:.N,by=V1]$ct 

During the cycle, we can check whether kth the most frequent value will be knocked out by the current value.

  • Grab the first values โ€‹โ€‹of k and their counters to startrow : topk <- sort(tapply(ct[1:(startrow-1)],v[1:(startrow-1)],max))
  • Please note that the first element is a threshold for joining a top-k club: thresh <- unname(topk[1])
  • Loop from startrow to length(v) , updating correct (here is a vector, not the current amount) when a threshold is encountered; and updating the top-k club if the threshold is met and the value is not in the club yet.


What is it; the rest is just details. Here is my function:

 ff <- function(dat){ vf <- factor(dat$V1) v <- as.integer(vf) ct <- dat[,ct:=1:.N,by=V1]$ct n <- length(v) ct <- setNames(ct,v) startrow <- dat[,list(.I,.GRP),by=V1][.GRP==k+1]$.I[1] topk <- sort(tapply(ct[1:(startrow-1)],v[1:(startrow-1)],max)) thresh <- unname(topk[1]) correct <- rep(0L,n) correct[1:(startrow-1)] <- 1L for (i in startrow:n) { cti = ct[i] if ( cti >= thresh ){ correct[i] <- 1L if ( cti > thresh & !( names(cti) %in% names(topk) ) ){ topk <- sort(c(cti,topk))[-1] thresh <- unname(topk[1]) } } } sum(correct) } 

This is very fast, but different from @MaratTalipov and OP in its results:

 set.seed(1) dat <- data.table(sample(1:50,10000,replace=T)) k <- 5 f1(dat) # 1012 f3(dat) # 1015 ff(dat) # 1719 

Here is my benchmark (excluding the OP approach, as encapsulated in f1() , since I'm impatient):

 > benchmark(f3(dat),ff(dat),replications=10)[,1:5] test replications elapsed relative user.self 1 f3(dat) 10 2.68 2.602 2.67 2 ff(dat) 10 1.03 1.000 1.03 

My function gives more matches than @Marat and OP, because it allows the threshold binding to be considered "correct", while their counters are calculated only for no more than k values โ€‹โ€‹selected by any use of the R order function.

+3
source

[New solution]

Fast and very simple dplyr solution for k=1 . fC1 below considers the connections in the same way, i.e. Does not bind the gap. You will see that you can impose any tie-break rule on it. And it is really fast.

 library(dplyr) fC1 <- function(dat){ dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()-1) %>% ungroup() %>% slice(2:n()-1) %>% filter(count!=0) %>% mutate(z=cummax(count)) %>% filter(count==z) z <- dat1$z length(z) } set.seed(1234) dat<-data.table(sample(1:5000, 100000, replace=T)) system.time(a1 <- fC1(dat))[3] #returns 120 elapsed 0.04 system.time(a3m <- f3m(dat, 1))[3] #returns 29, same to the Python result which runs about 60s elapsed 89.72 system.time(a3 <- f3(dat, 1))[3] #returns 31. elapsed 95.07 

You can freely impose a tie-break rule on the result of fC1 to come up with other solutions. For example, to arrive at solutions f3m or f3 , we restrict the selection of some lines as follows:

 fC1_ <- function(dat){ b <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()-1) %>% ungroup() %>% mutate(L=cummax(count+1))# %>% b1 <- b %>% slice(2:(n()-1)) %>% group_by(L) %>% slice(1) %>% filter(count+1>=L& count>0) b2 <- b %>% group_by(L) %>% slice(1) %>% ungroup() %>% select(-L) %>% mutate(L=count) semi_join(b1, b2, by=c("V1", "L")) %>% nrow } set.seed(1234) dat <- data.table(sample(1:50,10000,replace=T)) fC1_(dat) #[1] 218 f3m(dat, 1) #[1] 217 f3(dat, 1) #[1] 218 

and for an earlier example

 set.seed(1234) dat<-data.table(sample(1:5000, 100000, replace=T)) system.time(fC1_(dat))[3];fC1_(dat) #elapsed # 0.05 #[1] 29 

Somehow, I could not extend the solution to a generic k>1 , so I resorted to Rcpp.

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> countrank(std::vector<int> y, int k) { std::vector<int> v(y.begin(), y.begin() + k); std::make_heap(v.begin(), v.end()); std::vector<int> count(y.size()); for(int i=0; i < y.size(); i++){ if(y[i]==0){count[i]=0;} else{ v.push_back(y[i]); std::push_heap(v.begin(), v.end()); std::pop_heap(v.begin(), v.end()); v.pop_back(); std::vector<int>::iterator it = std::find (v.begin(), v.end(), y[i]); if (it != v.end()) {count[i]=1;}; } } return count; } 

For k=1 it is worth noting that fC1 no less fast than the next Rcpp version of fCpp .

  fCpp <- function(dat, k) { dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()) x <- dat1$V1 y <- dat1$count-1 z <- countrank(-y, k) sum(z[2:(nrow(dat)-1)]) } 

Again, you can impose any tie-break rule with minimal effort.


[ f3, f3m functions]

f3 owned by @Marat Talipov, and f3m is some kind of amendment to it (it seems to seem unnecessary).

 f3m <- function(dat, k){ n <- nrow(dat) dat1 <- tbl_df(dat) %>% group_by(V1) %>% mutate(count=row_number()) x <- dat1$V1 y <- dat1$count rank <- rep(NA, n) tablex <- numeric(max(x)) for(i in 2:(n-1)){ if(y[i]==1){rank[i]=NA} #this condition was originally missing else{ tablex[x[i-1]] = y[i-1] rank[i]=match(x[i], order(tablex, decreasing = T)) } } rank <- rank[2:(n-1)] sum(rank<=k, na.rm=T) } 

Refer to change history for an earlier solution.

+3
source

How about this solution:

 # unique values unq_vals <- sort(dat[, unique(V1)]) # cumulative count for each unique value by row cum_count <- as.data.table(lapply(unq_vals, function(x) cumsum(dat$V1==x))) # running ranking for each unique value by row cum_ranks <- t(apply(-cum_count, 1, rank, ties.method='max')) 

Now the rank (for example) of the 2nd unique value in the eighth observation is stored in:

 cum_ranks[8, 2] 

You can get the rank of each element by row (and present it in a readable table) like this. If rank <= k for row i, then the i-th element of V1 is one of the k-th most frequent elements, as well as for observing i.

 dat[, .(V1, rank=sapply(1:length(V1), function(x) cum_ranks[x, V1[x]]))] 

The first code block takes only 0.6883929 seconds on my machine (according to the rough time now <- Sys.time(); [code block in here]; Sys.time() - now ), dat <- data.table(sample(1:50, 10000, replace=T))

+2
source

Source: https://habr.com/ru/post/982814/


All Articles