Create an effective weekly weekly calculation with a subset

In my working dataset, I am trying to calculate weekly values ​​for changes in wholesale and revenue. The code seems to work, but according to my estimates, it will take about 75 hours to run what seems like a simple calculation. The following is a general reproducible version that takes about 2 m to work with this smaller dataset:

######################################################################################################################## # MAKE A GENERIC REPORDUCIBLE Qaru QUESTION ######################################################################################################################## # Create empty data frame of 26,000 observations similar to my data, but populated with noise exampleData <- data.frame(product = rep(LETTERS,1000), wholesale = rnorm(1000*26), revenue = rnorm(1000*26)) # create a week_ending column which increases by one week with every set of 26 "products" for(i in 1:nrow(exampleData)){ exampleData$week_ending[i] <- as.Date("2016-09-04")+7*floor((i-1)/26) } exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01") # create empty columns to fill exampleData$wholesale_wow <- NA exampleData$revenue_wow <- NA # loop through the wholesale and revenue numbers and append the week-over-week changes for(i in 1:nrow(exampleData)){ # set a condition where the loop only appends the week-over-week values if it not the first week if(exampleData$week_ending[i]!="2016-09-04"){ # set temporary values for the current and past week wholesale value currentWholesale <- exampleData$wholesale[i] lastWeekWholesale <- exampleData$wholesale[which(exampleData$product==exampleData$product[i] & exampleData$week_ending==exampleData$week_ending[i]-7)] exampleData$wholesale_wow[i] <- currentWholesale/lastWeekWholesale -1 # set temporary values for the current and past week revenue currentRevenue <- exampleData$revenue[i] lastWeekRevenue <- exampleData$revenue[which(exampleData$product==exampleData$product[i] & exampleData$week_ending==exampleData$week_ending[i]-7)] exampleData$revenue_wow[i] <- currentRevenue/lastWeekRevenue -1 } } 

Any help understanding why this is taking so long or how to shorten the time would be greatly appreciated!

+5
source share
2 answers

The first for loop can be simplified with the following:

 exampleData$week_ending2 <- as.Date("2016-09-04") + 7 * floor((seq_len(nrow(exampleData)) - 1) / 26) setequal(exampleData$week_ending, exampleData$week_ending2) [1] TRUE 

Replacing the second for loop

 library(data.table) dt1 <- as.data.table(exampleData) dt1[, wholesale_wow := wholesale / shift(wholesale) - 1 , by = product] dt1[, revenue_wow := revenue / shift(revenue) - 1 , by = product] setequal(exampleData, dt1) [1] TRUE 

It takes about 4 milliseconds to work on my laptop

+6
source

Here is a vector solution using the tidyr package.

 set.seed(123) # Create empty data frame of 26,000 observations similar to my data, but populated with noise exampleData <- data.frame(product = rep(LETTERS,1000), wholesale = rnorm(1000*26), revenue = rnorm(1000*26)) # create a week_ending column which increases by one week with every set of 26 "products" #vectorize the creating of the data i<-1:nrow(exampleData) exampleData$week_ending <- as.Date("2016-09-04")+7*floor((i-1)/26) exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01") # create empty columns to fill exampleData$wholesale_wow <- NA exampleData$revenue_wow <- NA #find the index of rows of interest (ie removing the first week) i<-i[exampleData$week_ending!="2016-09-04"] library(tidyr) #create temp variables and convert into wide format # the rows are product and the columns are the ending weeks Wholesale<-exampleData[ ,c(1,2,4)] Wholesale<-spread(Wholesale, week_ending, wholesale) Revenue<-exampleData[ ,c(1,3,4)] Revenue<-spread(Revenue, week_ending, revenue) #number of columns numCol<-ncol(Wholesale) #remove the first two columns for current wholesale #remove the first and last column for last week wholesale #perform calculation on ever element in dataframe (divide this week/lastweek) Wholesale_wow<- Wholesale[ ,-c(1, 2)]/Wholesale[ ,-c(1, numCol)] - 1 #convert back to long format Wholesale_wow<-gather(Wholesale_wow) #repeat for revenue Revenue_wow<- Revenue[ ,-c(1, 2)]/Revenue[ ,-c(1, numCol)] - 1 #convert back to long format Revenue_wow<-gather(Revenue_wow) #assemble calculated values back into the original dataframe exampleData$wholesale_wow[i]<-Wholesale_wow$value exampleData$revenue_wow[i]<-Revenue_wow$value 

The strategy was to convert the source data to a wide format, where the rows were the product identifier and the columns were weeks. Then split the data frames into each other. Convert back to a long format and add the newly computed values ​​to the exampleData data frame. This works, not very clean, but much faster than a loop. The dplyr package is another tool for this kind of work.

To compare the results of this code with you, use a test case:

 print(identical(goldendata, exampleData)) 

If goldendata is your good results, be sure to use the same random numbers with the set.seed () function.

+1
source

Source: https://habr.com/ru/post/1272173/


All Articles