Using data.tables, trying to aggregate data by column index

I am having problems using the data.table package. I use this package because it seems very fast and efficient with memory, and it will work with a very large dataset (~ 6m x 300).

So, basically an example of the problem I came across is:

AA <- matrix(runif(50,0,100), 10,5) AA <- data.table(AA) colnames(AA) <- c("one","two","three","four","five") AA[,"key"] <- c(1:10) setkey(AA,key) BB <- matrix(c("A1","A1","B1","A1","C1","F1","T1","Y1","S1","S1","B2","C2","V2","G2","R2","U2","P2","Q2","A2","R2"),10,2) BB <- data.table(BB) BB[,"key"] <- c(1:10) setkey(BB,key) CC <- AA[BB] 

This gives the following

 > CC key one two three four five V1 V2 [1,] 1 70.528360 7.901987 66.827238 44.51487 26.22273 A1 B2 [2,] 2 38.560889 31.808611 7.877950 34.51093 51.27989 A1 C2 [3,] 3 70.164154 16.636281 59.127573 79.95673 19.07643 B1 V2 [4,] 4 82.019267 86.958215 3.335632 44.19048 46.29047 A1 G2 [5,] 5 24.980403 25.352212 78.240760 93.69818 46.64401 C1 R2 [6,] 6 1.062644 30.214449 15.920193 35.15496 97.86995 F1 U2 [7,] 7 5.242374 47.591899 56.879902 70.05319 82.48689 T1 P2 [8,] 8 69.646271 69.576102 38.766948 38.62866 74.69404 Y1 Q2 [9,] 9 25.335255 54.638416 5.777238 80.87692 34.11951 S1 A2 [10,] 10 54.844424 18.645826 59.370042 48.24352 84.02630 S1 R2 

What I'm trying to do is aggregate data for V1 and V2

 > CC[,length(one), by=V1] V1 V1.1 [1,] A1 3 [2,] B1 1 [3,] C1 1 [4,] F1 1 [5,] T1 1 [6,] Y1 1 [7,] S1 2 > CC[,length(one), by=V2] V2 V1 [1,] B2 1 [2,] C2 1 [3,] V2 1 [4,] G2 1 [5,] R2 2 [6,] U2 1 [7,] P2 1 [8,] Q2 1 [9,] A2 1 

The problem I am facing is that if I do not know explicitly the names of the columns that I want to generate, or if I want to execute a loop, say 100 columns get 100 different aggregates, how can I do this?

The data.table reference manual states that this works the way it is done, since the variables are referenced in the data area of ​​the table, so CC [, V1] will give one column, while CC [, "V1"] won "T. it says you can use something like

 x <- quote(V1) CC[,length(one), by=eval(x)] 

But this does not work, I tried several things, such as setting variable names in a vector and various combinations of quote (), noquote (), enquote (), but I cannot seem to figure out if this is possible.

How can I configure this to loop through the list of column names for each of them?

If not, are there better ways to quickly create a large dataset?

Thanks.

+4
source share
1 answer

I'm not sure what you are going to do - I think you may have to come up with a better example of what you are trying to do.

You can, for example, pass a character vector to by , so this will work:

 agg.by <- "V1" CC[, length(one), by=agg.by] 

If you want to summarize the β€œunknown” columns in your subsets, you can lapply by the table .SD data.table, which is inside the area inside each of your aggregates, for example:

 CC[, lapply(.SD, mean), by=agg.by] 

If you are only summing several columns from the original data table, use the .SDcols argument, for example:

 CC[, lapply(.SD, mean), by=agg.by, .SDcols=c('one', 'two')] 

I think that some combination of the above will relate to the question that you ask, but it is difficult for me to understand what you are after.

If you can give the best piece of sample data and expected results, I will be happy to help you.

+5
source

Source: https://habr.com/ru/post/1393519/


All Articles