I am using Spark 2.2.0.
I use Spark to process datasets from S3. It worked fine until I decided to use wildcards to read data from subfolders of the folder test.
val path = "s3://data/test"
val spark = SparkSession
.builder()
.appName("Test")
.config("spark.sql.warehouse.dir", path)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val myData = spark.read.parquet(path + "/*/")
I get the following error:
17/11/20 18:54:21 ERROR ApplicationMaster: user class threw an exception: org.apache.spark.sql.AnalysisException: Path does not exist: HDFS: //ip-111-112-11-65.eu-west- 1.compute.internal: 8020 / user / HDFS / s3 / data / test / 20171120 / *;
I am executing the above code with the following command:
spark-submit
I don’t understand why Spark is trying to read from HDFS instead of reading from the provided path. The same piece of code works fine with a different path, for example s3://data/test2/mytest.parquet.
source
share