How about a short c++ written for loop using Rcpp . This small function takes a numeric vector, i.e. Your ordernum column and the argument threshold (the cumulative sum from which you want to start a new identifier), and returns a vector of length identifiers equal to the input vector. It should run relatively fast since it is a for loop in c++ . The following code snippet will install Rcpp for you if you haven't installed it yet and compiled a function ready for use. Just copy and paste into R ...
if( !require(Rcpp) ) install.packages("Rcpp"); require(Rcpp) Rcpp::cppFunction( ' NumericVector grpid( NumericVector x , int threshold ){ int n = x.size(); NumericVector out(n); int tot = 0; int id = 1; for( int i = 0; i < n; ++i){ tot += x[i]; out[i] = id; if( tot >= threshold ){ id += 1; tot = 0; } } return out; }')
Then, to use a function, just use it like any other R function, supplying the appropriate arguments:
df$groupid <- grpid( df$ordernum , 30 )
COMPARATIVE DIAGRAM
The OP asked me to compare the Rcpp loop with the R base for the loop. Here is the code and the results. Approximately 400-fold increase in speed on a vector of 100,000 product identifiers:
set.seed(1) x <- sample(30,1e5,repl=T) for.loop <- quote({ tot <- 0 id <- 1 out <- numeric(length(x)) for( i in 1:length(x) ){ tot <- tot + x[i] out[i] <- id if( tot >= 30 ){ tot <- 0 id <- id + 1 } } }) rcpp.loop <- quote( out <- grpid(x,30)) require( microbenchmark ) print( bm , unit = "relative" , digits = 2 , "median" ) Unit: relative expr min lq median uq max neval eval(rcpp.loop) 1 1 1 1 1 50 eval(for.loop) 533 462 442 428 325 50
source share