I am having problems using the data.table package. I use this package because it seems very fast and efficient with memory, and it will work with a very large dataset (~ 6m x 300).
So, basically an example of the problem I came across is:
AA <- matrix(runif(50,0,100), 10,5) AA <- data.table(AA) colnames(AA) <- c("one","two","three","four","five") AA[,"key"] <- c(1:10) setkey(AA,key) BB <- matrix(c("A1","A1","B1","A1","C1","F1","T1","Y1","S1","S1","B2","C2","V2","G2","R2","U2","P2","Q2","A2","R2"),10,2) BB <- data.table(BB) BB[,"key"] <- c(1:10) setkey(BB,key) CC <- AA[BB]
This gives the following
> CC key one two three four five V1 V2 [1,] 1 70.528360 7.901987 66.827238 44.51487 26.22273 A1 B2 [2,] 2 38.560889 31.808611 7.877950 34.51093 51.27989 A1 C2 [3,] 3 70.164154 16.636281 59.127573 79.95673 19.07643 B1 V2 [4,] 4 82.019267 86.958215 3.335632 44.19048 46.29047 A1 G2 [5,] 5 24.980403 25.352212 78.240760 93.69818 46.64401 C1 R2 [6,] 6 1.062644 30.214449 15.920193 35.15496 97.86995 F1 U2 [7,] 7 5.242374 47.591899 56.879902 70.05319 82.48689 T1 P2 [8,] 8 69.646271 69.576102 38.766948 38.62866 74.69404 Y1 Q2 [9,] 9 25.335255 54.638416 5.777238 80.87692 34.11951 S1 A2 [10,] 10 54.844424 18.645826 59.370042 48.24352 84.02630 S1 R2
What I'm trying to do is aggregate data for V1 and V2
> CC[,length(one), by=V1] V1 V1.1 [1,] A1 3 [2,] B1 1 [3,] C1 1 [4,] F1 1 [5,] T1 1 [6,] Y1 1 [7,] S1 2 > CC[,length(one), by=V2] V2 V1 [1,] B2 1 [2,] C2 1 [3,] V2 1 [4,] G2 1 [5,] R2 2 [6,] U2 1 [7,] P2 1 [8,] Q2 1 [9,] A2 1
The problem I am facing is that if I do not know explicitly the names of the columns that I want to generate, or if I want to execute a loop, say 100 columns get 100 different aggregates, how can I do this?
The data.table reference manual states that this works the way it is done, since the variables are referenced in the data area of ββthe table, so CC [, V1] will give one column, while CC [, "V1"] won "T. it says you can use something like
x <- quote(V1) CC[,length(one), by=eval(x)]
But this does not work, I tried several things, such as setting variable names in a vector and various combinations of quote (), noquote (), enquote (), but I cannot seem to figure out if this is possible.
How can I configure this to loop through the list of column names for each of them?
If not, are there better ways to quickly create a large dataset?
Thanks.