Boto S3 API does not return a complete list of keys

I am using the boto S3 API in my python script, which slowly copies data from S3 to my local file system. The script worked well for several days, but now there is a problem.

I use the following API function to get a list of keys in a "directory":

keys = bucket.get_all_keys(prefix=dirname) 

And this function ( get_all_keys ) does not always return the full list of keys, I mean that I can see more keys through the AWS web interface or through aws s3 ls s3://path .

Reproduced in versions 2.15 and 2.30.

Maybe boto cached some of my requests on S3 (I repeat the same requests over and over)? How to solve this problem, any suggestions?

+6
source share
4 answers

There is an easier way. The Bucket object itself can act as an iterator, and it knows how to handle paginated responses. Thus, if there are more results available, he automatically takes them backstage. So, something like this should allow you to iterate over all the objects in your bucket:

 for key in bucket: # do something with your key 

If you want to specify a prefix and get a list of all keys starting with this prefix, you can do this as follows:

 for key in bucket.list(prefix='foobar'): # do something with your key 

Or, if you really want to create a list of objects, simply do the following:

 keys = [k for k in bucket] 

Note, however, that buckets can contain an unlimited number of keys, so be careful with this because it will list all the keys in memory.

+12
source

Just managed to get it to work! It turned out that I had 1013 keys in my S3 directory, and get_all_keys can only return 1000 keys due to AWS API restrictions.

The solution is simple, just use a higher level function without the delimiter parameter:

 keys = list(bucket.list(prefix=dirname)) 
+5
source

You need to break the pages into results by making multiple queries. list () will do this automatically. You can use the example below for more control or to resume failed requests.

This iterative approach is also more scalable, if you work, there will be millions of objects.

 marker = None while True: keys = bucket.get_all_keys(marker=marker) last_key = None for k in keys: # TODO Do something with your keys! last_key = k.name if not keys.is_truncated: break marker = last_key 

ResultSet docs from get_all_keys () docs tell me that this should be done automatically using an iterator, but it’s not. :(

+3
source

Use pagination in boto3. this function should give you the answer:

 def s3_list_files(bucket_name, prefix): paginator = client.get_paginator("list_objects") page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix) keys = [] for page in page_iterator: if "Contents" in page: for key in page["Contents"]: keyString = key["Key"] keys.append(keyString) return keys if keys else [] 
+1
source

Source: https://habr.com/ru/post/971925/


All Articles