Why can't I read with AWS S3 in the Spark app?

Question

Why can't I read with AWS S3 in the Spark app?

I updated Apache Spark 1.5.1, but I'm not sure if this caused it. I have spark-submit passkeys that always worked.

Exception in thread "main" java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V SQLContext sqlContext = new SQLContext(sc); DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .load("s3n://ossem-replication/gdelt_data/event_data/" + args[0]); df.write() .format("com.databricks.spark.csv") .save("/user/spark/ossem_data/gdelt/" + args[0]);

More bugs below. There is a class that does not contain a method, so it means that the dependencies are incompatible. It seems that jets3t does not contain the RestS3Service method. (Lorg / jets3t / service / security / AWSCredentials;) V Can someone explain this to me?

 Exception in thread "main" java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:60) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at org.apache.hadoop.fs.s3native.$Proxy24.initialize(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:272) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1277) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.take(RDD.scala:1272) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1312) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.first(RDD.scala:1311) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:101) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:99) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:82) at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:42) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at com.bah.ossem.spark.GdeltSpark.main(GdeltSpark.java:20) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497)

+5

java amazon-s3 apache-spark

Abu sulaiman Nov 22 '15 at 5:47

source share

1 answer

Saket · Accepted Answer · 2016-02-26T13:59:47+0000

I had the same problem, but with Spark 1.6, and I use Scala instead of Java. The reason for this error is that Spark Core has a client version of Hadoop version 2.2, and the installation of the Spark cluster that I used was 1.6. I had to make the following changes to make it work.

Change the hadoop client dependency to version 2.6 (the version of Hadoop I used)
```
 "org.apache.hadoop" % "hadoop-client" % "2.6.0", 
```
Include the hadoop-aws library in my hot Spark jar since this dependency is no longer included in the Hadoop libraries in version 1.6
```
 "org.apache.hadoop" % "hadoop-aws" % "2.6.0", 
```
Export the AWS key and secret as environment variables.

Specify the following Hadoop configuration in SparkConf

 val sparkContext = new SparkContext(sparkConf) val hadoopConf = sparkContext.hadoopConfiguration hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", sys.env.getOrElse("AWS_ACCESS_KEY_ID", "")) hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env.getOrElse("AWS_SECRET_ACCESS_KEY", ""))

Why can't I read with AWS S3 in the Spark app?

More articles: