Spark R 2.0 is very slow

Question

Spark R 2.0 is very slow

I just started testing Spark R 2.0 and found that dapply execution is very slow.

For example, the following code

set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])

user  system elapsed 
0.005   0.000   0.006 `

runs in 6 ms

Now, if I create Spark DF on 4 partitions and run on 4 cores, I get:

sparkR.session(master = "local[4]")

random_DF_Spark <- repartition(createDataFrame(random_DF),4)

subset_DF_Spark <- dapply(
    random_DF_Spark,
    function(x) {
        y <- x[x[1] > 1, ]
        y
    },
    schema(random_DF_Spark))

system.time(dummy_res_Spark<-collect(subset_DF_Spark))

user  system elapsed 
2.003   0.119  62.919

those. 1 minute, which is abnormally slow .... Am I missing something?

I also get a warning (TaskSetManager: Stage 64 contains a task with a very large size (16411 KB). The maximum recommended size of a task is 100 KB.). Why is this 100k limit so low?

I am using R 3.3.0 on Mac OS 10.10.5

Any insight is appreciated!

+4

r apache-spark sparkr

Yann-aël le borgne Aug 6 '16 at 11:03

source share

No one has answered this question yet.

See related questions:

466

Fast reading of very large tables as data

eighteen

spark streaming checkpoint recovery is very slow