Spark / Hadoop - Failed to save to s3 using server-side encryption

Question

Spark / Hadoop - Failed to save to s3 using server-side encryption

I am running AWS EMR Cluster to run spark jobs. To work with s3 buckets, the hadoop configuration is set using access keys, private keys, enableServerSideEncryption and the algorithm that will be used for encryption. See code below

val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", "xxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx") hadoopConf.set("fs.s3.enableServerSideEncryption", "true") hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")

In the above configuration, the spark program can read from the s3 bucket and perform processing. But it fails when he tries to save the results in s3, which ensures that the data must be encrypted. If the bucket allows unencrypted data, then it is successfully stored in unencrypted form.

This happens even if the cluster is built with the option providing server-side encryption --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256] .

hasoop distcp from hdfs to emr on s3 also fails. But s3-dist-copy (aws version of hdfs distcp) when installing the -s3ServerSideEncryption option works successfully.

But the ec2 instance has the required permission for the role to load data into the same bucket using server-side encryption without using any user access keys. See the sample command below. If -sse is omitted in the command below, it displays "Access denied error."

aws s3 cp test.txt s3://encrypted-bucket/ —sse

It will be useful if someone can help with the configuration required in the spark / hadoop to save the data in aws s3 using server-side encryption.

+5

amazon-s3 encryption hadoop emr apache-spark

stash Feb 22 '16 at 9:14

source share

1 answer

stash · Answer 1 · 2016-02-22T23:59:43+0000

This is now allowed. --emrfs did not apply the configuration correctly. But the parameter below with aws emr create-cluster works with both spark and hadoop distcp .

--configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.enableServerSideEncryption":"true"},"Configurations":[]}]'

When ec2 instances were configured with a read / write bucket profile for the bucket, my spark code worked without the need for aws passkeys.

Additional emr configuration options are available that you can use with the --configuration option with emr create-cluster http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

I'm not sure why aws emr gives 2 options for doing the exact same thing. One works, while the others do not.

Spark / Hadoop - Failed to save to s3 using server-side encryption

More articles: