Data.table 1.8.11 and problems with aggregation

UPDATE : FAST FIXED. See below.


Here is some interesting behavior I found with data.table 1.8.11 (r1101, 2014-01-28). The order of the variables included in the by clause changes the aggregation results:

>   foo = data.table(a=rep(c(0,1,0,1),2), b=rep(c(T,T,F,F),2), c=c(1,1,1,1,1,1,1,1))
>   foo
   a     b c
1: 0  TRUE 1
2: 1  TRUE 1
3: 0 FALSE 1
4: 1 FALSE 1
5: 0  TRUE 1
6: 1  TRUE 1
7: 0 FALSE 1
8: 1 FALSE 1
>   foo[, .N, by=list(b, a)]
       b a N
1:  TRUE 0 1
2:  TRUE 1 1
3: FALSE 0 1
4: FALSE 1 1
5:  TRUE 0 1
6:  TRUE 1 1
7: FALSE 0 1
8: FALSE 1 1
>   foo[, .N, by=list(a, b)]
   a     b N
1: 0  TRUE 2
2: 1  TRUE 2
3: 0 FALSE 2
4: 1 FALSE 2
> 

This does not happen in the stable release of data.table (1.8.10).

+4
source share
1 answer

Thanks for reporting. This is now fixed in v1.8.11 commit 1103. From NEWS :

o ​​, - fastorder, . # 5307. Clayton Stanley SO: data.table 1.8.11


require(data.table) # commit 1103 v1.8.11
foo[, .N, by=list(b,a)]
       b a N
1:  TRUE 0 2
2:  TRUE 1 2
3: FALSE 0 2
4: FALSE 1 2

foo[, .N, by=list(a,b)]
   a     b N
1: 0  TRUE 2
2: 1  TRUE 2
3: 0 FALSE 2
4: 1 FALSE 2
+4

Source: https://habr.com/ru/post/1524327/


All Articles