A possible multithreading algorithm listing all the keys in a large S3 bucket?

Question

A possible multithreading algorithm listing all the keys in a large S3 bucket?

In S3 buckets containing a large number of keys, enumerating keys via REST api is a painfully slow process, since

You can only display 1000 keys at a time.
The only way to determine the 5001st key (as far as I can tell) is to list the first 1000 keys, list the next one based on the next token in the answer, and then list until you reach 5001.
S3 REST api requests have a very high delay, a request for 1000 keys usually takes a few seconds.

Given that creating 100 simultaneous key requests for REST requests should not slow down any individual request, this process would otherwise be ripe for optimization through parallelization. But if my algorithm is "stupid" and just breaks up the possible key space into predefined tokens (for example, "',' a ',' b ',' c ',' d ',' e '...), it won’t really speed up the enumeration of keys in a bucket, where each key begins with 'images /'

So, I am wondering if anyone who really came across S3 has a better way to bridge the gap in the bucket or if someone has experimented with an adaptive (i.e. "not stupid") algorithm to improve the key list using concurrency.

+4

rest amazon-s3 concurrent-programming

Johnny c Jan 9 '12 at 0:39

source share

1 answer

Tom andersen · Answer 1 · 2012-01-10T14:02:50+0000

Perhaps some form of binary search algorithm will help? EG start with prefixes on '' and 'm', then halfway, etc. I think that you will eventually receive each key a maximum of two times or so - you will stop calling more when you already have the next token.

How to choose where to start? I think that perhaps a division on each cycle: start "when these results return, if" the results show more keys ", then run" nextmarker "in this search PLUS a new search halfway between" nextmarker "and" z " , Repeat: Use a hash as a thing to store all keys only once.

Since all requests come in different threads, etc., you will need a lock to add all the keys. Then you have a problem with keeping this lock open enough not to slow down the work, so it will depend on what language, etc. You're using.

You may be able to do this faster if your process is running on an EC2 instance in the same region as S3 files. Say the files are in the US standard. Then you're in luck, you can use a ruby and something like Ironworker to get there and download all the keys. When this is done, it can send to your server or make a file on S3, which is a list of all keys or similar. For different regions or languages, you may need to run your own instance of EC2.

I found that the S3 key list is much faster on an EC2 instance, since there is a lot of bandwidth for each request (which you do not pay for EC2). S3 does NOT execute gzip responses, which are super fluffy XML, so the bandwidth between you and S3 is crucial.

A possible multithreading algorithm listing all the keys in a large S3 bucket?

More articles: