SparkR bottleneck in createDataFrame?

Question

SparkR bottleneck in createDataFrame?

I am new to Spark, SparkR, and generally all HDFS-related technologies. I recently installed Spark 1.5.0 and run simple code with SparkR:

Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6") .libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib") require('SparkR') require('data.table') sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) hiveContext <- sparkRHive.init(sc) n = 1000 x = data.table(id = 1:n, val = rnorm(n)) Sys.time() xs <- createDataFrame(sqlContext, x) Sys.time()

The code is executed immediately. However, when I change it to n = 1000000 , it takes about 4 minutes (the time between two calls to Sys.time() ). When I check these tasks in the console on the port: 4040, the task for n = 1000 has a duration of 0.2 s and the task for n = 1000000 0.3 s. Am I doing something wrong?

+5

r apache-spark sparkr

Krzysztof Jędrzejewski 01 Oct '15 at 12:55

source share

1 answer

zero323 · Accepted Answer · 2015-10-13T20:56:58+0000

You are not doing anything special. This is just the effect of a combination of different factors:

createDataFrame , as it happens (Spark 1.5.1), is slow. This is a known issue described in SPARK-8277 .
The current implementation does not work well with data.table .
R base is relatively slow. Smart people say this is a feature, not a mistake, but it still needs to be considered.

Until SPARK-8277 is resolved, you cannot do much, but there are two options you can try:

use plain old data.frame instead of data.table . Using a set of flight data (227496 rows, 14 columns):

 df <- read.csv("flights.csv") microbenchmark::microbenchmark(createDataFrame(sqlContext, df), times=3) ## Unit: seconds ## expr min lq mean median ## createDataFrame(sqlContext, df) 96.41565 97.19515 99.08441 97.97465 ## uq max neval ## 100.4188 102.8629 3

compared to data.table

 dt <- data.table::fread("flights.csv") microbenchmark::microbenchmark(createDataFrame(sqlContext, dt), times=3) ## Unit: seconds ## expr min lq mean median ## createDataFrame(sqlContext, dt) 378.8534 379.4482 381.2061 380.043 ## uq max neval ## 382.3825 384.722 3

Write to disk and use spark-csv to load the data directly into the Spark DataFrame without directly interacting with R. How crazy it sounds:

 dt <- data.table::fread("flights.csv") write_and_read <- function() { write.csv(dt, tempfile(), row.names=FALSE) read.df(sqlContext, "flights.csv", source = "com.databricks.spark.csv", header = "true", inferSchema = "true" ) } ## Unit: seconds ## expr min lq mean median ## write_and_read() 2.924142 2.959085 2.983008 2.994027 ## uq max neval ## 3.01244 3.030854 3

I'm not sure if it really makes sense to drag and drop data that can be processed in R to Spark, but not stop there.

Edit

This issue should be resolved with SPARK-11086 in Spark 1.6.0.

SparkR bottleneck in createDataFrame?

More articles: