Unzip the large ZIP file to Amazon S3

Question

Unzip the large ZIP file to Amazon S3

I work for a company that processes very large CSV files. Customers upload a file to Amazon S3 through a file picker . Then several server processes can read the file in parallel (i.e. starting from different points) to process it and save it to the database. If desired, clients can archive the file before downloading.

Is it true that the ZIP format does not allow you to unpack a single file in parallel? That is, it is impossible for several processes to read the ZIP file from different offsets (possibly with some overlap between the blocks) and transmit uncompressed data from there?

If I'm right, then I need a way to take a ZIP file on S3 and create an unzipped CSV, also on S3.

Does Amazon provide any services that can accomplish this task simply? I was hoping the Data Pipeline would do the job, but it seems to have limitations. For example, “CopyActivity does not support copying Amazon S3 compound files” ( source ), it seems that I cannot unzip anything larger than 5 GB using this. My understanding of the Data Pipeline is very limited, so I don’t know how suitable it is for this task or where I will look.
Is there a SaaS that does this job?

I can write code to download, decompress, and download the file multiple times back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would be ideal for running code (to avoid allocating unnecessary resources), but execution time is limited to 60 seconds. Plus, the use case seems so simple and general that I expect to find an existing solution.

+6

amazon-s3 amazon-web-services unzip zip

Alex hall 21 sept '15 at 14:28

source share

4 answers

E.J. Brennan · Answer 1 · 2015-09-21T18:32:20+0000

It is probably best that a notification about an S3 event is sent to the SQS queue every time a zip file is uploaded to S3 and have or more EC2 instances polling the queue while waiting for the files to be unpacked.

, , , SQS , ( ).

Dilip Rajkumar · Answer 2 · 2018-03-21T23:18:17+0000

@.. , AWS, , Lambda . - , .

, S3.
SQS.
EC2 SQS.
Un ZIP.
SQS, - .

, -. ,

/ !

. AWS Glue . . .

, -.

Keviv · Answer 3 · 2019-02-04T22:51:30+0000

@Dilp Rajkumar - , AWS Glue.

J2U · Answer 4 · 2019-07-03T16:10:26+0000

EMR , ( , ) , .

:

thezeep.zip S3 /mnt
/mnt/thezeep/
S3.

20 , ZIP 10 , , 100 .

, EMR .

NB. , /mnt/ / , . , , ... ...
EBS, .

Unzip the large ZIP file to Amazon S3

More articles: