Split apply recombine, plyr, data.table in R

Question

Split apply recombine, plyr, data.table in R

I am doing the classic split-apply-recombine job in R. My dataset is a bunch of firms over time. The application that I do leads to a regression for each firm and returns residuals, so I do not aggregate the firm. plyr great for this, but it takes a very long time to work when the number of firms is large. Is there a way to do this using data.table ?

Sample data:

 dte, id, val1, val2 2001-10-02, 1, 10, 25 2001-10-03, 1, 11, 24 2001-10-04, 1, 12, 23 2001-10-02, 2, 13, 22 2001-10-03, 2, 14, 21

I need to split each id (namely 1 and 2). Run the regression, return the leftovers and add it as a column to my data. Is there a way to do this using data.table ?

+6

split r data.table apply plyr

Alex Jul 01 '12 at 3:18

source share

2 answers

DWin's answer is correct for v1.8.0 (as it is currently on CRAN). But in v1.8.1 (in the R-Forge repository) := now works on a group. It also works for non-contiguous groups, so first you do not need to setkey to line up.

 dtb <- as.data.table(dat) dtb dte id val1 val2 1: 2001-10-02 1 10 25 2: 2001-10-03 1 11 24 3: 2001-10-04 1 12 23 4: 2001-10-02 2 13 22 5: 2001-10-03 2 14 21 dtb[, resid:=residuals(lm(val1 ~ val2)), by=id] dte id val1 val2 resid 1: 2001-10-02 1 10 25 1.631688e-15 2: 2001-10-03 1 11 24 -3.263376e-15 3: 2001-10-04 1 12 23 1.631688e-15 4: 2001-10-02 2 13 22 0.000000e+00 5: 2001-10-03 2 14 21 0.000000e+00

To upgrade to v1.8.1, just install it from the R-Forge repo. (R 2.15.0+ is required when installing any binary package from R-Forge):

 install.packages("data.table", repos="http://R-Forge.R-project.org")

or install from the source if you cannot upgrade to the latest R. data.table itself only needs R 2.12.0 +.

Extension to 1MM case:

 DT = data.table(dte=Sys.Date()+1:1000000, id=sample(1:2, 1000000, repl=TRUE), val1=runif(1000000), val2=runif(1000000) ) setkey(DT, id) system.time(ans1 <- cbind(DT, DT[, residuals(lm(val1 ~ val2)), by="id"]) ) user system elapsed 12.272 0.872 13.182 ans1 dte id val1 val2 id V1 1: 2012-07-02 1 0.8369147 0.57553383 1 0.336647598 2: 2012-07-05 1 0.0109102 0.02532214 1 -0.488633325 3: 2012-07-06 1 0.4977762 0.16607786 1 -0.001952414 --- 999998: 4750-05-27 2 0.1296722 0.62645838 2 -0.370627034 999999: 4750-05-28 2 0.2686352 0.04890710 2 -0.231952238 1000000: 4750-05-29 2 0.9981029 0.91626787 2 0.497948275 system.time(DT[, resid:=residuals(lm(val1 ~ val2)), by=id]) user system elapsed 7.436 0.648 8.107 DT dte id val1 val2 resid 1: 2012-07-02 1 0.8369147 0.57553383 0.336647598 2: 2012-07-05 1 0.0109102 0.02532214 -0.488633325 3: 2012-07-06 1 0.4977762 0.16607786 -0.001952414 --- 999998: 4750-05-27 2 0.1296722 0.62645838 -0.370627034 999999: 4750-05-28 2 0.2686352 0.04890710 -0.231952238 1000000: 4750-05-29 2 0.9981029 0.91626787 0.497948275

In the above example, only two groups are quite small at a level of less than 40 MB, and Rprof shows that 96% of the time is spent on lm . So in these cases := for the group, not for speed, the advantage is valid, but more for convenience; that is, less code needed for writing and extra columns added to the output. As the size grows, it includes avoiding copies, and speed advantages begin to manifest. Especially, transform in j will slow down terribly as the number of groups increases.

+8

Matt dowle Jul 01 '12 at 17:06

source share

42- · Accepted Answer · 2012-07-01T03:32:11+0000

I assume that this needs to be sorted by "id" in order to fit correctly. Fortunately, this happens automatically when you install the key:

 dat <-read.table(text="dte, id, val1, val2 2001-10-02, 1, 10, 25 2001-10-03, 1, 11, 24 2001-10-04, 1, 12, 23 2001-10-02, 2, 13, 22 2001-10-03, 2, 14, 21 ", header=TRUE, sep=",") dtb <- data.table(dat) setkey(dtb, "id") dtb[, residuals(lm(val1 ~ val2)), by="id"] #--------------- cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) #--------------- dte id val1 val2 id.1 V1 [1,] 2001-10-02 1 10 25 1 1.631688e-15 [2,] 2001-10-03 1 11 24 1 -3.263376e-15 [3,] 2001-10-04 1 12 23 1 1.631688e-15 [4,] 2001-10-02 2 13 22 2 0.000000e+00 [5,] 2001-10-03 2 14 21 2 0.000000e+00 > dat <- data.frame(dte=Sys.Date()+1:1000000, id=sample(1:2, 1000000, repl=TRUE), val1=runif(1000000), val2=runif(1000000) ) > dtb <- data.table(dat) > setkey(dtb, "id") > system.time( cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) ) user system elapsed 1.696 0.798 2.466 > system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] ) user system elapsed 1.757 0.908 2.690

EDIT from Matthew : This is all correct for v1.8.0 on CRAN. With the slight addition that transform in j is the subject of the data.table wiki , clause 2: "For speed do not transform() by group, cbind() after". But := now works on a group in v1.8.1 and is simple and fast. See My answer for an illustration (but no need to vote for it).

Ok, I voted for it. Here is the console command to install v 1.8.1 on Mac (if you have the necessary Xcode avaialble tools, since they are only in the source):

 install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source", lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")

(For some reason, I was unable to get the Mac GUI package installer to read r-forge as a repository.)

Split apply recombine, plyr, data.table in R

More articles: