How to get multiple objects from S3 using boto3 get_object (Python 2.7)

I have 100 thousand objects stored in S3. My requirement entails the need to download a subset of these objects (from 5 to ~ 3000) and read the binary contents of each object. From reading the boto3 / AWS CLI documents, it seems that it is impossible to get several objects in one request, so I currently implemented this as a loop that builds the key of each object, requests for the object then read the body of the object

for column_key in outstanding_column_keys: try: s3_object_key = "%s%s-%s" % (path_prefix, key, column_key) data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key) metadata_dict = data_object["Metadata"] metadata_dict["key"] = column_key metadata_dict["version"] = float(metadata_dict["version"]) metadata_dict["data"] = data_object["Body"].read() records.append(Record(metadata_dict)) except Exception as exc: logger.info(exc) if len(records) < len(column_keys): raise Exception("Some objects are missing!") 

My problem is that when I try to get several objects (for example, 5 objects), I go back 3, and some are not processed by the time all the objects are checked. I handle this with a special exception. I came up with a solution to wrap the above code snippet in a while loop, because I know the outstanding keys that I need:

 while (len(outstanding_column_keys) > 0) and (load_attempts < 10): for column_key in outstanding_column_keys: try: s3_object_key = "%s%s-%s" % (path_prefix, key, column_key) data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key) metadata_dict = data_object["Metadata"] metadata_dict["key"] = column_key metadata_dict["version"] = float(metadata_dict["version"]) metadata_dict["data"] = data_object["Body"].read() records.append(Record(metadata_dict)) except Exception as exc: logger.info(exc) if len(records) < len(column_keys): raise Exception("Some objects are missing!") 

But I realized this, suspecting that S3 is actually still processing outstanding responses, and the while loop is unnecessarily making additional requests for objects that S3 is already in the process of returning.

I did a separate study to verify that get_object requests are synchronous and it seems that they are:

 import boto3 import time import os s3_client = boto3.client('s3', aws_access_key_id=os.environ["S3_AWS_ACCESS_KEY_ID"], aws_secret_access_key=os.environ["S3_AWS_SECRET_ACCESS_KEY"]) print "Saving 3000 objects to S3..." start = time.time() for x in xrange(3000): key = "greeting_{}".format(x) s3_client.put_object(Body="HelloWorld!", Bucket='bucket_name', Key=key) end = time.time() print "Done saving 3000 objects to S3 in %s" % (end - start) print "Sleeping for 20 seconds before trying to load the saved objects..." time.sleep(20) print "Loading the saved objects..." arr = [] start_load = time.time() for x in xrange(3000): key = "greeting_{}".format(x) try: obj = s3_client.get_object(Bucket='bucket_name', Key=key) arr.append(obj) except Exception as exc: print exc end_load= time.time() print "Done loading the saved objects. Found %s objects. Time taken - %s" % (len(arr), end_load - start_load) 

My question and something I need is:

  • Are get_object requests get_object synchronous? If so, then I expect that when I check the loaded objects in the first piece of code, then they should all be returned.
  • If get_object requests are asynchronous, then how to handle the responses in such a way as to avoid additional requests to S3 for objects that are still in the process of returning?
  • Additional clarity / refutation of any of my assumptions about S3 would also be appreciated.

Thanks!

+5
source share

Source: https://habr.com/ru/post/1273918/


All Articles