R data.table for several columns in 1 column and sum

I have the following data.table :

 > dt = data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400)) > dt sales_ccy sales_amt cost_ccy cost_amt 1: USD 500 GBP -100 2: EUR 600 USD -200 3: GBP 700 GBP -300 4: USD 800 USD -400 

My goal is to get the following data.table :

 > dt ccy total_amt 1: EUR 600 2: GBP 300 3: USD 700 

Basically, I want to summarize all costs and sales together by currency. Actually this data.table has> 500,000 rows, so I would like to get a quick and efficient way to sum the sums.

Any idea on a quick way to do this?

+6
source share
5 answers

Using the data table v1.9.6+ , which has an improved version of melt , which can melt in multiple columns at the same time,

 require(data.table) # v1.9.6+ melt(dt, measure = patterns("_ccy$", "_amt$") )[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)] 
+8
source

You can consider merged.stack from my splitstackshape package.

Here I also used "dplyr" for the pipeline, but you can skip this if you want.

 library(dplyr) library(splitstackshape) dt %>% mutate(id = 1:nrow(dt)) %>% merged.stack(var.stub = c("ccy", "amt"), sep = "var.stubs", atStart = FALSE) %>% .[, .(total_amt = sum(amt)), by = ccy] # ccy total_amt # 1: GBP 300 # 2: USD 700 # 3: EUR 600 

The development version "data.table" should be able to handle column melting groups. It is also faster than merged.stack .

+7
source

Even messier than @Pgibas solution:

 dt[, list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt by=list(sales_ccy, cost_ccy) # nro of rows reduced to only unique combination ales_ccy, cost_ccy ][, sum(V2), # this will aggregate the new columns by=V1 ] 

Benchmark

I did a couple of tests to check my code for a solution with data table 1.9.5 proposed by Arun.

Just an observation, I just generated 500K + rows duplicating the original data table. This reduced the number of sales_ccy / cost_ccy pairs, which also reduced the number of rows dropped by the second data.table [] (only 8 rows were created in this case).

I don’t think that in the real world scenario the number of rows returned will be about 500K + (maybe, but I studied this thing a while ago, N ^ 2, where N is the amount of currency used), but this is something else to have in mind by looking at these results.

 library(data.table) library(microbenchmark) rm(dt) dt <- data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400)) dt for (i in 1:17) dt <- rbind(dt,dt) mycode <-function() { test1 <- dt[, list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt keyby=list(sales_ccy, cost_ccy) ][, sum(V2), # this will aggregate the new columns by=V1 ] rm(test1) } suggesteEdit <- function() { test2 <- dt[ , .(c(sales_ccy, cost_ccy), c(sales_amt, cost_amt)) # combine cols ][, .(tot_amt = sum(V2)), keyby= .(ccy = V1) # aggregate + reorder ] rm(test2) } meltWithDataTable195 <- function() { test3 <- melt(dt, measure = list( c(1,3), c(2,4) ))[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)] rm(test3) } microbenchmark( mycode(), suggesteEdit(), meltWithDataTable195() ) 

Result

 Unit: milliseconds expr min lq mean median uq max neval mycode() 12.27895 12.47456 15.04098 12.80956 14.73432 45.26173 100 suggesteEdit() 25.36581 29.56553 42.52952 33.39229 59.72346 69.74819 100 meltWithDataTable195() 25.71558 30.97693 47.77700 58.68051 61.23996 66.49597 100 
+3
source

Edited . Another way to do this is with aggregate ()

 df = data.frame(ccy = c(dt$sales_ccy, dt$cost_ccy), total_amt = c(dt$sales_amt, dt$cost_amt)) out= aggregate(total_amt ~ ccy, data = df, sum) 
+3
source

Dirty but working

 # Bind costs and sales df <- rbind(df[,list(ccy = cost_ccy, total_amt = cost_amt)], df[,list(ccy = sales_ccy, total_amt = sales_amt)]) # Sum for every currency df[, sum(total_amt), by = ccy] ccy V1 1: GBP 300 2: USD 700 3: EUR 600 
+2
source

Source: https://habr.com/ru/post/985869/


All Articles