R data.table for several columns in 1 column and sum

Question

R data.table for several columns in 1 column and sum

I have the following data.table :

 > dt = data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400)) > dt sales_ccy sales_amt cost_ccy cost_amt 1: USD 500 GBP -100 2: EUR 600 USD -200 3: GBP 700 GBP -300 4: USD 800 USD -400

My goal is to get the following data.table :

 > dt ccy total_amt 1: EUR 600 2: GBP 300 3: USD 700

Basically, I want to summarize all costs and sales together by currency. Actually this data.table has> 500,000 rows, so I would like to get a quick and efficient way to sum the sums.

Any idea on a quick way to do this?

+6

r group-by data.table

chengcj Apr 24 '15 at 10:43

source share

5 answers

Arun · Answer 1 · 2015-04-24T11:12:02+0000

Using the data table v1.9.6+ , which has an improved version of melt , which can melt in multiple columns at the same time,

 require(data.table) # v1.9.6+ melt(dt, measure = patterns("_ccy$", "_amt$") )[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2015-04-24T11:06:37+0000

You can consider merged.stack from my splitstackshape package.

Here I also used "dplyr" for the pipeline, but you can skip this if you want.

 library(dplyr) library(splitstackshape) dt %>% mutate(id = 1:nrow(dt)) %>% merged.stack(var.stub = c("ccy", "amt"), sep = "var.stubs", atStart = FALSE) %>% .[, .(total_amt = sum(amt)), by = ccy] # ccy total_amt # 1: GBP 300 # 2: USD 700 # 3: EUR 600

The development version "data.table" should be able to handle column melting groups. It is also faster than merged.stack .

mucio · Answer 3 · 2015-04-24T12:01:05+0000

Even messier than @Pgibas solution:

 dt[, list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt by=list(sales_ccy, cost_ccy) # nro of rows reduced to only unique combination ales_ccy, cost_ccy ][, sum(V2), # this will aggregate the new columns by=V1 ]

Benchmark

I did a couple of tests to check my code for a solution with data table 1.9.5 proposed by Arun.

Just an observation, I just generated 500K + rows duplicating the original data table. This reduced the number of sales_ccy / cost_ccy pairs, which also reduced the number of rows dropped by the second data.table [] (only 8 rows were created in this case).

I don’t think that in the real world scenario the number of rows returned will be about 500K + (maybe, but I studied this thing a while ago, N ^ 2, where N is the amount of currency used), but this is something else to have in mind by looking at these results.

 library(data.table) library(microbenchmark) rm(dt) dt <- data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400)) dt for (i in 1:17) dt <- rbind(dt,dt) mycode <-function() { test1 <- dt[, list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt keyby=list(sales_ccy, cost_ccy) ][, sum(V2), # this will aggregate the new columns by=V1 ] rm(test1) } suggesteEdit <- function() { test2 <- dt[ , .(c(sales_ccy, cost_ccy), c(sales_amt, cost_amt)) # combine cols ][, .(tot_amt = sum(V2)), keyby= .(ccy = V1) # aggregate + reorder ] rm(test2) } meltWithDataTable195 <- function() { test3 <- melt(dt, measure = list( c(1,3), c(2,4) ))[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)] rm(test3) } microbenchmark( mycode(), suggesteEdit(), meltWithDataTable195() )

Result

 Unit: milliseconds expr min lq mean median uq max neval mycode() 12.27895 12.47456 15.04098 12.80956 14.73432 45.26173 100 suggesteEdit() 25.36581 29.56553 42.52952 33.39229 59.72346 69.74819 100 meltWithDataTable195() 25.71558 30.97693 47.77700 58.68051 61.23996 66.49597 100

Veerendra gadekar · Answer 4 · 2015-04-24T12:44:25+0000

Edited . Another way to do this is with aggregate ()

 df = data.frame(ccy = c(dt$sales_ccy, dt$cost_ccy), total_amt = c(dt$sales_amt, dt$cost_amt)) out= aggregate(total_amt ~ ccy, data = df, sum)

PoGibas · Answer 5 · 2015-04-24T10:55:23+0000

Dirty but working

 # Bind costs and sales df <- rbind(df[,list(ccy = cost_ccy, total_amt = cost_amt)], df[,list(ccy = sales_ccy, total_amt = sales_amt)]) # Sum for every currency df[, sum(total_amt), by = ccy] ccy V1 1: GBP 300 2: USD 700 3: EUR 600

R data.table for several columns in 1 column and sum

More articles: