Extract RAR files from Google Cloud Storage

I have a large multi-page compressed CSV file using the RAR utility (compressed 100 GB, compressed 20 GB), so I have 100 parts of the RAR file that were uploaded to Google Cloud Storage. I need to retrieve it in Google Cloud Storage. It would be better if I could use Python for GAE. Any ideas? I do not want to download, extract and download. I want to do it all in the cloud.

+4
source share
2 answers

It is not possible to directly unzip / extract your rar file in the cloud. Do you know the gsutil -m option (multithreading / multiprocessing)? It speeds up the transmission by controlling them in parallel. I would suggest the following sequence:

  • download compressed archive file
  • unzip locally
  • load unpacked files in parallel with gsutil -m cp file-pattern dest-bucket

If you have a very slow Internet connection, 20 GB should not take a very long time (maybe less than an hour, I would expect), as well as for parallel loading (although this is a function with the amount of parallelism that you get, which depends on on the size of archive files).

Btw, you can configure the parallelism used by gsutil -m using the variables parallel_thread_count and parallel_process_count in your $HOME/.boto .

+4
source

This question has already been given (and accepted), but for future similar use cases, I would recommend doing it completely in the cloud by deploying a tiny Linux instance in GCE, such as f1-micro , and then following the steps as Mark Cohen suggested in his answer . The instances come preloaded with gsutil , so it is easy to use. When you are done, just close and delete your micro-instance, since the resulting file has already been saved in Google Cloud Storage.

Step by step instructions:

The advantage is that instead of downloading to your own computer, you transfer all the data in the Google cloud, so the transfer should be very fast and does not depend on your own Internet connection speed or does not use any of your bandwidth.


Note. Network bandwidth is proportional to the size of the virtual machine (in vCPUs), so consider upgrading to a larger virtual machine to improve performance. The Google Compute Engine price for VM instances is as follows:

  • at least 10 minutes
  • rounded to the nearest minute

So, for example, given that n1-standard-1 costs USD0.05 / hour (as of October 8, 2016), 15 minutes of use will cost only $ 0.0125.

+6
source

Source: https://habr.com/ru/post/1440915/


All Articles