Saving RDD on Amazon S3

Question

Saving RDD on Amazon S3

I have a large text file containing JSON objects on Amazon S3. I plan to process this data using Spark on Amazon EMR.

Here are my questions:

How to load a text file containing JSON objects into Spark?
Is it possible to save the internal RDD representation of this data on S3 after the EMR cluster is disconnected?
If I can save the RDD view, is it possible to immediately load the data in RDD format the next time I need to analyze the same data?

+4

json amazon-s3 apache-spark

chandola Jul 03 '14 at 19:27

source share

1 answer

Abe · Answer 1 · 2014-07-03T23:32:32+0000

This should cover # 1 if you are using pyspark:

#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")

#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows

#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content

, s3 s3n://, s3://. .

, my-key S3 *. , , .

# 2 № 3 . s3:

my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')

, , S3 .

* S3 , , .

Saving RDD on Amazon S3

More articles: