Parsing files from Amazon S3 with Apache Spark

Question

Parsing files from Amazon S3 with Apache Spark

I use Apache Spark and I have to parse files from Amazon S3. How do I know the file extension when I extract files from an Amazon S3 path?

+1

java amazon-s3 amazon-web-services apache-spark

Vpn_talent Apr 27 '17 at 9:49

source share

1 answer

freedev · Accepted Answer · 2017-04-27T10:20:17+0000

I suggest following the Cloudera tutorial Accessing data stored in Amazon S3 through Spark

To access the data stored in Amazon S3 by Spark applications, you can use the API files the Hadoop ( SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDDand JavaHadoopRDD.saveAsNewAPIHadoopFile) to read and write RDD, providing URL-address form s3a://bucket_name/path/to/file.txt.
You can read and write Spark SQL DataFrames using the Data Source API.

, . (.. file.txt).

, S3, , , S3.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html

Parsing files from Amazon S3 with Apache Spark

More articles: