I am running AWS EMR Cluster to run spark jobs. To work with s3 buckets, the hadoop configuration is set using access keys, private keys, enableServerSideEncryption and the algorithm that will be used for encryption. See code below
val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", "xxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx") hadoopConf.set("fs.s3.enableServerSideEncryption", "true") hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")
In the above configuration, the spark program can read from the s3 bucket and perform processing. But it fails when he tries to save the results in s3, which ensures that the data must be encrypted. If the bucket allows unencrypted data, then it is successfully stored in unencrypted form.
This happens even if the cluster is built with the option providing server-side encryption --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256] .
hasoop distcp from hdfs to emr on s3 also fails. But s3-dist-copy (aws version of hdfs distcp) when installing the -s3ServerSideEncryption option works successfully.
But the ec2 instance has the required permission for the role to load data into the same bucket using server-side encryption without using any user access keys. See the sample command below. If -sse is omitted in the command below, it displays "Access denied error."
aws s3 cp test.txt s3://encrypted-bucket/ โsse
It will be useful if someone can help with the configuration required in the spark / hadoop to save the data in aws s3 using server-side encryption.
source share