Parsing files from Amazon S3 with Apache Spark

I use Apache Spark and I have to parse files from Amazon S3. How do I know the file extension when I extract files from an Amazon S3 path?

+1
source share
1 answer

I suggest following the Cloudera tutorial Accessing data stored in Amazon S3 through Spark

To access the data stored in Amazon S3 by Spark applications, you can use the API files the Hadoop ( SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDDand JavaHadoopRDD.saveAsNewAPIHadoopFile) to read and write RDD, providing URL-address form s3a://bucket_name/path/to/file.txt.

You can read and write Spark SQL DataFrames using the Data Source API.

, . (.. file.txt).

, S3, , , S3.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html

+1

Source: https://habr.com/ru/post/1683961/


All Articles