Emrfs file synchronization with s3 does not work

Question

Emrfs file synchronization with s3 does not work

After running the spark job in the Amazon EMR cluster, I deleted the output files directly from s3 and tried to run the job again. I got the following error when trying to write to the parquet file format on s3 using sqlContext.write:

'bucket/folder' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455)

I tried to run

emrfs sync s3://bucket/folder

which did not appear to resolve the error, even if it deleted some records from the DynamoDB instance that tracks metadata. Not sure what else I can try. How to fix this error?

+4

amazon-s3 pyspark amazon-emr

sakurashinken Oct 3 '16 at 1:03

source share

2 answers

- . s3 , dynamodb. hadoop , dynamodb. .

s3, dynamaoDB, . ,

, emrfs delete - , , .

emrfs delete   s3://path

, s3, dynamo db

emrfs import s3://path

s3 .

emrfs sync s3://path

, , s3,

emrfs diff s3://path

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html

+3

Achyuth 15 '17 19:38

sakurashinken · Accepted Answer · 2016-10-03T06:30:07+0000

It turned out that I needed to run

emrfs delete s3://bucket/folder

before starting synchronization. Implementation of the above solution to the problem.

Emrfs file synchronization with s3 does not work

More articles: