I am new to Spark, SparkR, and generally all HDFS-related technologies. I recently installed Spark 1.5.0 and run simple code with SparkR:
Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6") .libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib") require('SparkR') require('data.table') sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) hiveContext <- sparkRHive.init(sc) n = 1000 x = data.table(id = 1:n, val = rnorm(n)) Sys.time() xs <- createDataFrame(sqlContext, x) Sys.time()
The code is executed immediately. However, when I change it to n = 1000000 , it takes about 4 minutes (the time between two calls to Sys.time() ). When I check these tasks in the console on the port: 4040, the task for n = 1000 has a duration of 0.2 s and the task for n = 1000000 0.3 s. Am I doing something wrong?
source share