I am trying to spread the process of getting a list of 60 million keys (file names) from s3.
Background: I'm trying to process all the files in a folder, about 60 million, through pyspark. As a detailed HERE, a typical sc.textFile ('s3a: // bucket / *') will load all the data into the driver and then distribute it to the cluster. The proposed method is to first get a list of files, parallelize the list, and then each node to extract a subset of files.
Problem: This method still has a bottleneck in the βget a list of filesβ step if this list is large enough. This step of getting a list of keys (file names) in the s3 bucket must also be distributed in order to be effective.
What Ive tried: I tried two different methods:
using python aws api (boto3) which presents the results. Ideally, we could estimate the number of pages and distribute the range so that node 1 requests pages 1-100, node 2 requests pages 101-200, etc. Unfortunately, you cannot specify an arbitrary page identifier, you have to get the "next token" from the previous page, as well as a linked list of results.
aws cli in which they let you exclude and enable filters. As file names, I start with an 8-digit number, I could theoretically have node one request a complete list of files for files that match 10 * and a second node to request a full list of files for a file names that match 11 * and t .d. This is done by:
aws s3 --recursive --exclude = "include =" 10 "s3: // bucket /
Unfortunately, it seems that it performs a full scan of each request instead of using any index, since it freezes for> 15 minutes per request.
Is there any way to make any solution viable? Is there a third option? I am sure that I am not alone in the fact that you have millions of s3 files to digest.
source share