How to solve org.apache.spark.sql.AnalysisException: path does not exist

Question

How to solve org.apache.spark.sql.AnalysisException: path does not exist

I am using Spark 2.2.0.

I use Spark to process datasets from S3. It worked fine until I decided to use wildcards to read data from subfolders of the folder test.

val path = "s3://data/test"
val spark = SparkSession
  .builder()
  .appName("Test")
  .config("spark.sql.warehouse.dir", path)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._    
val myData = spark.read.parquet(path + "/*/")

I get the following error:

17/11/20 18:54:21 ERROR ApplicationMaster: user class threw an exception: org.apache.spark.sql.AnalysisException: Path does not exist: HDFS: //ip-111-112-11-65.eu-west- 1.compute.internal: 8020 / user / HDFS / s3 / data / test / 20171120 / *;

I am executing the above code with the following command:

spark-submit --deploy-mode cluster --driver-memory 10g

I don’t understand why Spark is trying to read from HDFS instead of reading from the provided path. The same piece of code works fine with a different path, for example s3://data/test2/mytest.parquet.

+4

scala amazon-s3 apache-spark

Markus Nov 21 '17 at 15:39

source share

No one has answered this question yet.

See related questions:

550

How can a time function exist in functional programming?

3

Sparklyr copy_to error

3

How can I write unit tests in Spark for an example of creating a basic data frame?