Perhaps some form of binary search algorithm will help? EG start with prefixes on '' and 'm', then halfway, etc. I think that you will eventually receive each key a maximum of two times or so - you will stop calling more when you already have the next token.
How to choose where to start? I think that perhaps a division on each cycle: start "when these results return, if" the results show more keys ", then run" nextmarker "in this search PLUS a new search halfway between" nextmarker "and" z " , Repeat: Use a hash as a thing to store all keys only once.
Since all requests come in different threads, etc., you will need a lock to add all the keys. Then you have a problem with keeping this lock open enough not to slow down the work, so it will depend on what language, etc. You're using.
You may be able to do this faster if your process is running on an EC2 instance in the same region as S3 files. Say the files are in the US standard. Then you're in luck, you can use a ruby ββand something like Ironworker to get there and download all the keys. When this is done, it can send to your server or make a file on S3, which is a list of all keys or similar. For different regions or languages, you may need to run your own instance of EC2.
I found that the S3 key list is much faster on an EC2 instance, since there is a lot of bandwidth for each request (which you do not pay for EC2). S3 does NOT execute gzip responses, which are super fluffy XML, so the bandwidth between you and S3 is crucial.
source share