How to upload csv file to SparkR on RStudio?

How do you upload a csv file to SparkR on RStudio? Below are the steps that I must follow to run SparkR on RStudio. I used read.df to read .csv do not know how to write this. Not sure if this step is considered to create an RDD.

#Set sys environment variables Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) #Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') #Load libraries library(SparkR) library(magrittr) sc <- sparkR.init(master="local") sc <- sparkR.init() sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) data <- read.df(sqlContext, "C:/Users/Desktop/DataSets/hello_world.csv", "com.databricks.spark.csv", header="true") 

I get an error:

 Error in writeJobj(con, object) : invalid jobj 1 
+5
source share
3 answers

Spark 2.0.0 + :

You can use csv data source:

 loadDF(sqlContext, path="some_path", source="csv", header="true") 

without loading spark-csv .

Original answer :

As far as I can tell, you are using the wrong version of spark-csv . Pre-built versions of Spark use Scala 2.10, but you use Spark CSV for Scala 2.11. Try instead:

 sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.10:1.2.0") 
+3
source

I will successfully solve this problem by providing commons-csv-1.2.jar along with the spark-csv package.

Spark-csv seems to use commons-csv, but is not a package with it.

Using the following SPARKR_SUBMIT_ARGS solved the problem (I use --jars, not - packages).

 Sys.setenv('SPARKR_SUBMIT_ARGS'='"--jars" "/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/spark-csv_2.11-1.2.0.jar,/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/commons-csv-1.2.jar" "sparkr-shell"') 

In fact, a rather obscure mistake

 Error in writeJobj(con, object) : invalid jobj 1 

More clearly using the R-shell directly from R Studio, and clearly indicate

 java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat 

The required commons-csv bit can be found here: https://commons.apache.org/proper/commons-csv/download_csv.cgi

+1
source

I appreciate all the materials and solutions! I figured out another way to load a CSV file into SparkR RStudio. Here he is:

 #set sc sc <- sparkR.init(master = "local") sqlContext <- sparkRSQL.init(sc) #load .csv patients <- read.csv("C:/...") #Insert your .csv file path df <- createDataFrame(sqlContext, patients) df head(df) str(df) 
+1
source

Source: https://habr.com/ru/post/1232648/


All Articles