Spark Write to S3 V4 SignatureDoesNotMatch Error

I come across S3 SignatureDoesNotMatch , trying to write a Dataframe to S3 using Spark.

Symptom / things tried:

  • Code error sometimes , but sometimes ;
  • The code can read from S3 without problems and be able to write to S3 from time to time, which eliminates incorrect configuration settings, such as S3A / enableV4 / Wrong Key / Region Endpoint , etc.
  • The endpoint S3A was established in accordance with the documents S3 endpoint S3 ;
  • Make sure that AWS_SECRETY_KEY does not contain any alphanumeric characters as suggested here ;
  • Make sure server time is synchronized using NTP;
  • The m3.xlarge was tested in EC2 m3.xlarge : spark-2.0.2-bin-hadoop2.7 works in local mode;
  • The problem will disappear when files are written to local fs;
  • Currently, the workaround was to install a bucket of s3fs and write there; however, this is not ideal, since s3fs quite often dies from the stress that Spark puts on him;

The code can be collapsed to:

 spark-submit\ --verbose\ --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \ --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\ --packages org.apache.hadoop:hadoop-aws:2.7.3\ --driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\ foobar.py # foobar.py sc = SparkContext.getOrCreate() sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx') sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx') sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com') hc = SparkSession.builder.enableHiveSupport().getOrCreate() dataframe = hc.read.parquet(in_file_path) dataframe.write.csv( path=out_file_path, mode='overwrite', compression='gzip', sep=',', quote='"', escape='\\', escapeQuotes='true', ) 

Spark sheds the following error .


Set log4j in the detailed description, it turned out that the following happened:

  • Each person will be displayed in staing position at S3 /_temporary/foorbar.part-xxx ;
  • A PUT call will move partitions to the final location;
  • After several successful PUT calls, all subsequent PUT calls failed due to 403;
  • Since reuqets were made by aws-java-sdk, they are not sure what to do at the application level; - The next magazine was from another event with the same error;

  >> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1 >> Host: XXX.s3-ap-southeast-2.amazonaws.com >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 >> X-Amz-Date: 20161104T005749Z >> x-amz-metadata-directive: REPLACE >> Connection: close >> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11 >> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c >> x-amz-server-side-encryption: aws:kms >> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet >> Accept-Ranges: bytes >> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d >> ETag: 31436915380783143f00299ca6c09253 >> Content-Type: application/octet-stream >> Content-Length: 0 DEBUG wire: << "HTTP/1.1 403 Forbidden[\r][\n]" DEBUG wire: << "x-amz-request-id: 849F990DDC1F3684[\r][\n]" DEBUG wire: << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]" DEBUG wire: << "Content-Type: application/xml[\r][\n]" DEBUG wire: << "Transfer-Encoding: chunked[\r][\n]" DEBUG wire: << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]" DEBUG wire: << "Server: AmazonS3[\r][\n]" DEBUG wire: << "Connection: close[\r][\n]" DEBUG wire: << "[\r][\n]" DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden << HTTP/1.1 403 Forbidden << x-amz-request-id: 849F990DDC1F3684 << x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM= << Content-Type: application/xml << Transfer-Encoding: chunked << Date: Fri, 04 Nov 2016 00:57:48 GMT << Server: AmazonS3 << Connection: close DEBUG requestId: x-amzn-RequestId: not available 
+8
source share
3 answers

I experienced the exact same problem and found a solution using this article ( other resources point in the same direction). After setting these configuration parameters, writing to S3 was successful:

 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.speculation false 

I am using Spark 2.1.1 with Hadoop 2.7. My last spark-submit command looks like this:

 spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 --conf spark.hadoop.fs.s3a.endpoint=s3.eu-central-1.amazonaws.com --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false ... 

In addition, I defined these environment variables:

 AWS_ACCESS_KEY_ID=**** AWS_SECRET_ACCESS_KEY=**** 
+2
source
  • What does β€œs3a” die? I'm interested. If you have a stack trace, file them on the Apache JIRA server, HADOOP project, fs / s3 component.
  • s3n does not support v4 API. this is not an endpoint issue, but a new signature. It will not have the jets3t library updated, except for security reasons, so stop working with it.

One of the problems that Spark will have with S3, regardless of the driver, is that it is ultimately a sequential storage of objects, where: renames take O (bytes) to complete, and deferred consistency between PUT and LIST can break the commit More fair: Spark assumes that after you write something in the file system, if you execute ls of the parent directory, you will find what you just wrote. S3 does not offer this, therefore, the term "ultimate consistency". Now, at HADOOP-13786, we are trying to improve, and HADOOP-13345 will see if we can use Amazon Dynamo for a faster and more consistent view of the world. But you have to pay a dynamodb premium for this feature.

Finally, everything that is currently known to troubleshoot s3a, including the possible causes of 403 errors, is online . Hope this helps, and if there is another reason you identify, patches are welcome

+1
source

I had the same problem and solved it aws-java-sdk:1.7.4 using aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 before hadoop-aws:3.0.0 .

However, to avoid dependency inconsistencies when interacting with AWS, I had to rebuild Spark and provide it with my own version of Hadoop 3.0.0.

I assume the main reason is that the v4 signature algorithm accepts the current timestamp, and then all Spark artists use the same signature to authenticate their PUT requests. But if you slip outside the β€œwindow” of time allowed by the algorithm, the request and all subsequent requests cannot force Spark to roll back changes and errors. This explains why calling .coalesce(1) or .repartition(1) always works, but the failure rate rises in proportion to the number of partitions being written.

+1
source

Source: https://habr.com/ru/post/1261306/


All Articles