Group and convert two columns in df to matrix R

I would like to convert the following data.frame into a matrix that counts the number of each ID of the bike station that happens per hour.

> dim (test)
[1] 80,623 5

head (test, n = 10)
   bikeid end.station.id start.station.id diff.time hour
1 16052 244 322 6544 14
2 16052 284 432 3406 21
3 16052 461 519 33416 3
4 16052 228 519 26876 13
5 16052 72 435 388 17
6 16052 319 127 27702 11
7 16052 282 2002 33882 4
8 16052 524 2021 2525 10
9 16052 387 351 2397 12
10 16052 388 526 32507 13


The result should look like this.

> sample2
   start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34
2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31
3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7
4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4
5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67
6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50
7               120  3  2  7 53 21 33 31 48 19 50 35 47  8 17 30  9 49  4 48 28 52  9 57 55
8               127 33 44 47 42  6 46 39 30 39 28 19 57 53 41 45 55  9 27 42 19 43 24 37 55
9               137 53 11 60  1 66 37 16  5  2 58  0 46 33  0 60 54 25 66 65 40 36 47 58 40
10              143 61  1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

, :

matrix <- test %>% 
  group_by(start.station.id, hour)%>%
  summarise(sum = nrow) %>%
  spread(hour, nrow) 

,

0
1

data.table:

library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
      start.station.id ~ hour, value.var = "N")

(, ):

dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")

:

set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
                   hour = sample(24, NN, T))

library(microbenchmark)

microbenchmark(times = 100L,
               preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
                              start.station.id ~ hour, value.var = "N"),
               postagg = dcast(test, start.station.id ~ hour, 
                               fun.aggregate = length, value.var = "hour"))

Unit: milliseconds
    expr      min       lq      mean   median        uq      max neval
  preagg 55.83240 59.88939  66.56289 61.37408  64.37049 166.8902   100
 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270   100

, , , test[ , .N, by = vars] data.table.

+1

Source: https://habr.com/ru/post/1626833/


All Articles