How to distribute key list selection to s3

I am trying to spread the process of getting a list of 60 million keys (file names) from s3.

Background: I'm trying to process all the files in a folder, about 60 million, through pyspark. As a detailed HERE, a typical sc.textFile ('s3a: // bucket / *') will load all the data into the driver and then distribute it to the cluster. The proposed method is to first get a list of files, parallelize the list, and then each node to extract a subset of files.

Problem: This method still has a bottleneck in the β€œget a list of files” step if this list is large enough. This step of getting a list of keys (file names) in the s3 bucket must also be distributed in order to be effective.

What Ive tried: I tried two different methods:

  • using python aws api (boto3) which presents the results. Ideally, we could estimate the number of pages and distribute the range so that node 1 requests pages 1-100, node 2 requests pages 101-200, etc. Unfortunately, you cannot specify an arbitrary page identifier, you have to get the "next token" from the previous page, as well as a linked list of results.

  • aws cli in which they let you exclude and enable filters. As file names, I start with an 8-digit number, I could theoretically have node one request a complete list of files for files that match 10 * and a second node to request a full list of files for a file names that match 11 * and t .d. This is done by:

    aws s3 --recursive --exclude = "include =" 10 "s3: // bucket /

Unfortunately, it seems that it performs a full scan of each request instead of using any index, since it freezes for> 15 minutes per request.

Is there any way to make any solution viable? Is there a third option? I am sure that I am not alone in the fact that you have millions of s3 files to digest.

+5
source share
2 answers

If you need an Amazon S3 content list, and you don’t need to completely update it, you can use Amazon S3 Storage Inventory , which will store a daily CSV list of all files in an S3 bucket. You can then use this list to run your pyspark tasks.

Similarly, you can maintain a database of all files with the database update process whenever objects are added / removed from the bucket using Amazon S3 Event Notification . Thus, the list is always updated and available for your pyspark tasks.

+4
source

You can use the Prefix parameter list_objects_v2 if your file names are sufficient as a way to split files.

0
source

Source: https://habr.com/ru/post/1262099/


All Articles