I work for a company that processes very large CSV files. Customers upload a file to Amazon S3 through a file picker . Then several server processes can read the file in parallel (i.e. starting from different points) to process it and save it to the database. If desired, clients can archive the file before downloading.
- Is it true that the ZIP format does not allow you to unpack a single file in parallel? That is, it is impossible for several processes to read the ZIP file from different offsets (possibly with some overlap between the blocks) and transmit uncompressed data from there?
If I'm right, then I need a way to take a ZIP file on S3 and create an unzipped CSV, also on S3.
- Does Amazon provide any services that can accomplish this task simply? I was hoping the Data Pipeline would do the job, but it seems to have limitations. For example, “CopyActivity does not support copying Amazon S3 compound files” ( source ), it seems that I cannot unzip anything larger than 5 GB using this. My understanding of the Data Pipeline is very limited, so I don’t know how suitable it is for this task or where I will look.
- Is there a SaaS that does this job?
I can write code to download, decompress, and download the file multiple times back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would be ideal for running code (to avoid allocating unnecessary resources), but execution time is limited to 60 seconds. Plus, the use case seems so simple and general that I expect to find an existing solution.
source
share