Saving RDD on Amazon S3

I have a large text file containing JSON objects on Amazon S3. I plan to process this data using Spark on Amazon EMR.

Here are my questions:

  • How to load a text file containing JSON objects into Spark?
  • Is it possible to save the internal RDD representation of this data on S3 after the EMR cluster is disconnected?
  • If I can save the RDD view, is it possible to immediately load the data in RDD format the next time I need to analyze the same data?
+4
source share
1 answer

This should cover # 1 if you are using pyspark:

#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")

#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows

#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content

, s3 s3n://, s3://. .

, my-key S3 *. , , .

# 2 № 3 . s3:

my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')

, , S3 .

* S3 , , .

+10

Source: https://habr.com/ru/post/1547070/


All Articles