I have a Python script that needs to process a large number of files. To get around Linux with a relatively small limit on the number of arguments that can be passed to the command, I use find -print0
with find -print0
xargs -0
.
I know that another option would be to use the Python glob module, but that will not help when I have a more advanced find
looking for modification times, etc.
When running my script in a large number of files, Python only accepts a subset of the arguments, the limitation I thought at first was in argparse
but seems to be in sys.argv
. I can not find documentation on this. This is mistake?
Here's an example Python script illustrating the point:
import argparse import sys import os parser = argparse.ArgumentParser() parser.add_argument('input_files', nargs='+') args = parser.parse_args(sys.argv[1:]) print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)
I have many files to run this:
$ find ~/ -name "*" -print0 | xargs -0 ls > filelist 748709 filelist
But it looks like xargs , or Python splits my large list of files and processes it with a few Python starts:
$ find ~/ -name "*" -print0 | xargs -0 python test.py pid: 4216 argv files 1819 number of files: 1819 pid: 4217 argv files 1845 number of files: 1845 pid: 4218 argv files 1845 number of files: 1845 pid: 4219 argv files 1845 number of files: 1845 pid: 4220 argv files 1845 number of files: 1845 pid: 4221 argv files 1845 number of files: 1845 ...
Why are multiple processes created to process the list? Why is he being deceived at all? I donβt think there are newlines in the file names and shouldn't -print0
and -0
take care of this problem? If new lines appeared, I would expect sed -n '1810,1830p' filelist
show some weirdness for the above example. What gives?
I almost forgot:
$ python -V Python 2.7.2+