Using subprocess.Popen for a process with a large output

I have Python code that runs an external application that works fine when the application has a small amount of output, but freezes when there is a lot. My code looks like this:

p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) errcode = p.wait() retval = p.stdout.read() errmess = p.stderr.read() if errcode: log.error('cmd failed <%s>: %s' % (errcode,errmess)) 

The docs have comments that seem to indicate a potential problem. Waiting to eat:

Warning. This will be inhibited if the child process generates enough output to the stdout or stderr channel, so that it blocks waiting for the OS buffer to receive more data. Use communicate() to avoid this.

although I communicate, I see:

Note. Reading data is buffered in memory, so do not use this method if the data size is large or unlimited.

Therefore, it is not clear to me that I should use any of them if I have a large amount of data. They do not indicate which method I should use in this case.

I need to return the value from exec and do parsing and use both stdout and stderr .

So, what is the equivalent method in Python for executing an external application that will have great output?

+25
python subprocess
Jul 24. '09 at 23:08
source share
6 answers

You block the reading of two files; The first should end before the second launch. If the application writes a lot for stderr , but nothing for stdout , then your process will sit waiting for data on stdout , which does not occur, and the program in which you work is sitting there, expecting it to be written on stderr for reading (which never will be - since you wait for stdout ).

There are several ways to fix this.

The simplest is not to intercept stderr ; leave stderr=None . Errors will be stderr directly to stderr . You cannot intercept them and display them as part of your own message. For command line tools, this is often normal. For other applications, this can be a problem.

Another simple approach is to redirect stderr to stdout , so you only have one input file: set stderr=STDOUT . This means that you cannot distinguish regular output from error output. This may or may not be acceptable, depending on how the application writes the output.

A complete and complex way of processing: select ( http://docs.python.org/library/select.html ). This allows you to read without locking: you get data whenever the data appears on either stdout or stderr . I would recommend this if it is really necessary. This probably doesn't work on Windows.

+16
Jul 24 '09 at 23:23
source share

A lot of output is subjective, so it’s a little difficult to make a recommendation. If the output volume is really large, you most likely will not want to capture it with a single read () call. You can try to write the output to a file, and then pull the data step by step like this:

 f=file('data.out','w') p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE) errcode = p.wait() f.close() if errcode: errmess = p.stderr.read() log.error('cmd failed <%s>: %s' % (errcode,errmess)) for line in file('data.out'): #do something 
+6
Jul 24 '09 at 23:18
source share
Glenn Maynard is right in his comments on dead ends. However, the best way to solve this problem is to create two threads, one for stdout and one for stderr, which read these corresponding threads to exhaustion and do everything you need with the exit.

The suggestion of using temporary files may or may not work for you depending on the size of the output, etc. and whether it is necessary to process the output of the subprocess as it is created.

As Heikki Toivonen suggested, you should look at the communicate method. However, this buffers the stdout / stderr subprocess in memory, and you get returns from the communicate call - this is not ideal for some scenarios. But the source of the communication method is worth a look.

Another example is in the package that I support, python-gnupg , where the gpg executable is created using subprocess to do the hard work, and the Python shell starts the gpg streams to read stdout and stderr and consumes them since the data is created by gpg. You can get some ideas by looking at the source there. The data obtained by gpg for both stdout and stderr can be quite large in the general case.

+6
Jul 25 '09 at 19:14
source share

Reading stdout and stderr independently with very large output (i.e. a lot of megabytes) with select :

 import subprocess, select proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \ stdout=subprocess.PIPE, stderr=subprocess.PIPE) with open(outpath, "wb") as outf: dataend = False while (proc.returncode is None) or (not dataend): proc.poll() dataend = False ready = select.select([proc.stdout, proc.stderr], [], [], 1.0) if proc.stderr in ready[0]: data = proc.stderr.read(1024) if len(data) > 0: handle_stderr_data(data) if proc.stdout in ready[0]: data = proc.stdout.read(1024) if len(data) == 0: # Read of zero bytes means EOF dataend = True else: outf.write(data) 
+5
Dec 02 '16 at 9:49
source share

You can try to communicate and see if this solves your problem. If not, I redirected the output to a temporary file.

+2
Jul 24 '09 at 23:24
source share

I had the same problem. If you need to handle large output, another good option would be to use the file for stdout and stderr and pass these files for each parameter.

Check out the tempfile module in python: https://docs.python.org/2/library/tempfile.html .

Something like this might work

 out = tempfile.NamedTemporaryFile(delete=False) 

Then you would do:

 Popen(... stdout=out,...) 

Then you can read the file and erase it later.

+1
Jul 24 '14 at 20:28
source share



All Articles