Poor file upload performance from S3 with boto and multiprocessing.

Question

Poor file upload performance from S3 with boto and multiprocessing.

I want to download thousands of files from S3. To speed up the process, I tried Python multiprocessing.Pool , but I am very unreliable. Sometimes this works, and it is much faster than the single-core version, but often some files take several seconds, so multiprocessing takes longer than one process. Several times I even get ssl.SSLError: The read operation timed out .

What could be the reason?

 from time import time from boto.s3.connection import S3Connection from boto.s3.key import Key from multiprocessing import Pool import pickle access_key=xxx secret_key=xxx bucket_name=xxx path_list = pickle.load(open('filelist.pickle','r')) conn = S3Connection(access_key, secret_key) bucket = conn.get_bucket(bucket_name) pool = Pool(32) def read_file_from_s3(path): starttime = time() k = Key(bucket) k.key = path content = k.get_contents_as_string() print int((time()-starttime)*1000) return content results = pool.map(read_file_from_s3, path_list) # or results = map(read_file_from_s3, path_list) for a single process comparison pool.close() pool.join()

[Refresh] I ended up adding timeouts with repetition ( imap + .next(timeout) ) to my multiprocessor code, but only because at the moment I did not want to change too much. If you want to do it right, use the Jan Philip score using gevent.

+1

python amazon-s3 multiprocessing boto

Framester Nov 19 '14 at 11:46

source share

1 answer

Jan-Philip Gehrcke · Accepted Answer · 2014-11-19T12:32:59+0000

"What could be causing this?"

Not enough details. One reason may be that your private Internet connection is starving from too many concurrent connections. But since you did not indicate in which environment you are executing this piece of code, this is pure speculation.

However, there is no suggestion that your approach to solving this problem is very inefficient. multiprocessing designed to solve CPU problems. Retrieving data over multiple TCP connections is not directly related to the CPU. Having one process on a TCP connection is a waste of resources.

The reason this seems slow is because in your case one process spends a lot of time waiting to return system calls (the operating system on the other hand spends a lot of time waiting for the network module to do what it was told (and the network component spends a lot of time waiting for packets to arrive over the wire)).

You do not need several processes to make your computer spend less time waiting. You do not even need multiple threads. You can retrieve data from many TCP connections in a single OS-level thread using collaborative scheduling. In Python, this is often done using greenlet. A higher level module using green dots, gevent .

The web is full of gevent-based examples for disabling many HTTP requests - at the same time . When properly connected to the Internet, a single OS level thread can handle hundreds or thousands or ten thousand simultaneous connections simultaneously. In these orders, the value then evolves to be tied to I / O or tied to a processor, depending on the specific purpose of your application. That is, either a network connection, or a CPU bus, or a separate processor core limits your application.

Regarding ssl.SSLError: The read operation timed out -like errors: in the world of networks, you have to consider that such things happen from time to time and decide (depending on the details of your application) how you want to deal with these situations. Often a simple attempt at repetition is a good solution.

Poor file upload performance from S3 with boto and multiprocessing.

More articles: