Emrfs file synchronization with s3 does not work

After running the spark job in the Amazon EMR cluster, I deleted the output files directly from s3 and tried to run the job again. I got the following error when trying to write to the parquet file format on s3 using sqlContext.write:

'bucket/folder' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455)

I tried to run

emrfs sync s3://bucket/folder

which did not appear to resolve the error, even if it deleted some records from the DynamoDB instance that tracks metadata. Not sure what else I can try. How to fix this error?

+4
source share
2 answers

It turned out that I needed to run

emrfs delete s3://bucket/folder

before starting synchronization. Implementation of the above solution to the problem.

+8
source

- . s3 , dynamodb. hadoop , dynamodb. .

s3, dynamaoDB, . ,

, emrfs delete - , , .

emrfs delete   s3://path

, s3, dynamo db

emrfs import s3://path

s3 .

emrfs sync s3://path      

, , s3,

emrfs diff s3://path 

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html

+3

Source: https://habr.com/ru/post/1656541/


All Articles