The fastest way to sort each row of a large matrix in R

I have a big matrix:

set.seed(1) a <- matrix(runif(9e+07),ncol=300) 

I want to sort each row in a matrix:

 > system.time(sorted <- t(apply(a,1,sort))) user system elapsed 42.48 3.40 45.88 

I have a lot of RAM to work with, but I would like to complete this operation faster.

+6
source share
2 answers

Well, I donโ€™t know how many sorting methods are faster in R, and the problem is that you only sort 300 values, but many times. However, you can get extra performance outside of sorting by directly calling sort.int and using method='quick' :

 set.seed(1) a <- matrix(runif(9e+07),ncol=300) # Your original code system.time(sorted <- t(apply(a,1,sort))) # 31 secs # sort.int with method='quick' system.time(sorted2 <- t(apply(a,1,sort.int, method='quick'))) # 27 secs # using a for-loop is slightly faster than apply (and avoids transpose): system.time({sorted3 <- a; for(i in seq_len(nrow(a))) sorted3[i,] <- sort.int(a[i,], method='quick') }) # 26 secs 

But it is best to use a parallel package to sort matrix parts in parallel. However, the overhead of data transfer is apparently too high, and on my machine it starts to replace, since I "only" have 8 GB of memory:

 library(parallel) cl <- makeCluster(4) system.time(sorted4 <- t(parApply(cl,a,1,sort.int, method='quick'))) # Forever... stopCluster(cl) 
+5
source

The grr package contains an alternative sorting method that can be used to speed up this specific operation (I slightly reduced the size of the matrix so that this test does not run forever):

 > set.seed(1) > a <- matrix(runif(9e+06),ncol=300) > microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort)) + ,sorted2 <- t(apply(a,1,sort.int, method='quick')) + ,sorted3 <- t(apply(a,1,grr::sort2)),times=3,unit='s') Unit: seconds expr min lq mean median uq max neval sorted <- t(apply(a, 1, sort)) 1.7699799 1.865829 1.961853 1.961678 2.057790 2.153902 3 sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 1.6162934 1.619922 1.694914 1.623551 1.734224 1.844898 3 sorted3 <- t(apply(a, 1, grr::sort2)) 0.9316073 1.003978 1.050569 1.076348 1.110049 1.143750 3 

The difference becomes dramatic when the matrix contains characters:

 > set.seed(1) > a <- matrix(sample(letters,size = 9e6,replace = TRUE),ncol=300) > microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort)) + ,sorted2 <- t(apply(a,1,sort.int, method='quick')) + ,sorted3 <- t(apply(a,1,grr::sort2)),times=3) Unit: seconds expr min lq mean median uq max neval sorted <- t(apply(a, 1, sort)) 15.436045 15.479742 15.552009 15.523440 15.609991 15.69654 3 sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 15.099618 15.340577 15.447823 15.581536 15.621925 15.66231 3 sorted3 <- t(apply(a, 1, grr::sort2)) 1.728663 1.733756 1.780737 1.738848 1.806774 1.87470 3 

The results are identical for all three.

 > identical(sorted,sorted2,sorted3) [1] TRUE 
+3
source

Source: https://habr.com/ru/post/909681/


All Articles