How to add group id without loop?

I have a dataframe, for example:

productid ordernum p1 10 p2 20 p3 30 p4 5 p5 20 p6 8 

I would like to add another column called groupid, it groups products in a sequence and as soon as the sum (ordernum) reaches 30, it assigns a new group identifier, for example. the result should be

  productid ordernum groupid p1 10 1 p2 20 1 p3 30 2 p4 5 3 p5 20 3 p6 8 3 

This is very easy to do, looping, how can I achieve this without a loop?

+4
source share
1 answer

How about a short c++ written for loop using Rcpp . This small function takes a numeric vector, i.e. Your ordernum column and the argument threshold (the cumulative sum from which you want to start a new identifier), and returns a vector of length identifiers equal to the input vector. It should run relatively fast since it is a for loop in c++ . The following code snippet will install Rcpp for you if you haven't installed it yet and compiled a function ready for use. Just copy and paste into R ...

 if( !require(Rcpp) ) install.packages("Rcpp"); require(Rcpp) Rcpp::cppFunction( ' NumericVector grpid( NumericVector x , int threshold ){ int n = x.size(); NumericVector out(n); int tot = 0; int id = 1; for( int i = 0; i < n; ++i){ tot += x[i]; out[i] = id; if( tot >= threshold ){ id += 1; tot = 0; } } return out; }') 

Then, to use a function, just use it like any other R function, supplying the appropriate arguments:

 df$groupid <- grpid( df$ordernum , 30 ) # productid ordernum groupid #1 p1 10 1 #2 p2 20 1 #3 p3 30 2 #4 p4 5 3 #5 p5 20 3 #6 p6 8 3 

COMPARATIVE DIAGRAM

The OP asked me to compare the Rcpp loop with the R base for the loop. Here is the code and the results. Approximately 400-fold increase in speed on a vector of 100,000 product identifiers:

 set.seed(1) x <- sample(30,1e5,repl=T) for.loop <- quote({ tot <- 0 id <- 1 out <- numeric(length(x)) for( i in 1:length(x) ){ tot <- tot + x[i] out[i] <- id if( tot >= 30 ){ tot <- 0 id <- id + 1 } } }) rcpp.loop <- quote( out <- grpid(x,30)) require( microbenchmark ) print( bm , unit = "relative" , digits = 2 , "median" ) Unit: relative expr min lq median uq max neval eval(rcpp.loop) 1 1 1 1 1 50 eval(for.loop) 533 462 442 428 325 50 
+4
source

Source: https://habr.com/ru/post/1499769/


All Articles