Apache Spark MLlib with the DataFrame API provides java.net.URISyntaxException when creating createDataFrame () or read (). Csv (...)

Question

Apache Spark MLlib with the DataFrame API provides java.net.URISyntaxException when creating createDataFrame () or read (). Csv (...)

In a stand-alone application (running on java8, Windows 10 with spark-xxx_2.11: 2.0.0 parameters as jar dependencies), the following code gives an error:

/* this: */
Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList(
    new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)),
    new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2))
  ), LabeledPoint.class);

/* or this: */
/* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv",
            "C:/files/project/file.csv", "file:/C:/files/project/file.csv",
            "file:///C:/files/project/file.csv", "/file.csv" */
Dataset<Row> logData = spark_session.read().csv(logFile);

An exception:

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/files/project/spark-warehouse
               at org.apache.hadoop.fs.Path.initialize(Path.java:206)
               at org.apache.hadoop.fs.Path.<init>(Path.java:172)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
               at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
               at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
               at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
               at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
               at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
               at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
               at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
               at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
               at <call in my line of code>

How to load csv file Dataset<Row>from java code?

+1

java apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

Roman roshchin Aug 3 '16 at 14:23

source share

1 answer

abaghel · Accepted Answer · 2016-08-04T05:13:56+0000

There is a problem with the file system. See Jira https://issues.apache.org/jira/browse/SPARK-15899 . For a workaround, you can set "spark.sql.warehouse.dir" to SparkSession, as shown below.

SparkSession spark = SparkSession
  .builder()
  .appName("JavaALSExample")
  .config("spark.sql.warehouse.dir", "/file:C:/temp")
  .getOrCreate();

Apache Spark MLlib with the DataFrame API provides java.net.URISyntaxException when creating createDataFrame () or read (). Csv (...)

More articles: