Python script to merge all the files in a directory into a single file

I wrote the following script to merge all the files in a directory into a single file.

Can this be optimized in terms of

  • idiomatic python

  • time

Here is a snippet:

import time, glob outfilename = 'all_' + str((int(time.time()))) + ".txt" filenames = glob.glob('*.txt') with open(outfilename, 'wb') as outfile: for fname in filenames: with open(fname, 'r') as readfile: infile = readfile.read() for line in infile: outfile.write(line) outfile.write("\n\n") 
+11
source share
6 answers

Use shutil.copyfileobj to copy data:

 import shutil with open(outfilename, 'wb') as outfile: for filename in glob.glob('*.txt'): if filename == outfilename: # don't want to copy the output into the output continue with open(filename, 'rb') as readfile: shutil.copyfileobj(readfile, outfile) 

shutil reads from the readfile object in chunks, writing them directly to the outfile file. Do not use readline() or the iteration buffer, as you do not need the overhead of finding line endings.

Use the same mode for reading and writing; This is especially important when using Python 3; I used binary mode for both.

+31
source

Using Python 2.7, I checked some "tests"

 outfile.write(infile.read()) 

vs

 shutil.copyfileobj(readfile, outfile) 

I repeated over 20 .txt files ranging in size from 63 MB to 313 MB with a total file size of ~ 2.6 GB. In both methods, normal read mode performed better than binary read mode, and shutil.copyfileobj was usually faster than outfile.write.

When comparing the worst combination (outfile.write, binary mode) with the best combination (shutil.copyfileobj, normal reading mode), the difference was quite significant:

 outfile.write, binary mode: 43 seconds, on average. shutil.copyfileobj, normal mode: 27 seconds, on average. 

The original file had a final size of 2620 MB in normal read mode and 2578 MB in binary read mode.

+2
source

No need to use many variables.

 with open(outfilename, 'w') as outfile: for fname in filenames: with open(fname, 'r') as readfile: outfile.write(readfile.read() + "\n\n") 
+1
source

The fileinput module provides a natural way to iterate over multiple files.

 for line in fileinput.input(glob.glob("*.txt")): outfile.write(line) 
+1
source

I was curious to know more about performance, and I used the answers of Martin Peters and Stephen Miller.

I tried binary and text modes with shutil and without shutil . I tried to combine 270 files.

Text Mode -

 def using_shutil_text(outfilename): with open(outfilename, 'w') as outfile: for filename in glob.glob('*.txt'): if filename == outfilename: # don't want to copy the output into the output continue with open(filename, 'r') as readfile: shutil.copyfileobj(readfile, outfile) def without_shutil_text(outfilename): with open(outfilename, 'w') as outfile: for filename in glob.glob('*.txt'): if filename == outfilename: # don't want to copy the output into the output continue with open(filename, 'r') as readfile: outfile.write(readfile.read()) 

Binary mode -

 def using_shutil_text(outfilename): with open(outfilename, 'wb') as outfile: for filename in glob.glob('*.txt'): if filename == outfilename: # don't want to copy the output into the output continue with open(filename, 'rb') as readfile: shutil.copyfileobj(readfile, outfile) def without_shutil_text(outfilename): with open(outfilename, 'wb') as outfile: for filename in glob.glob('*.txt'): if filename == outfilename: # don't want to copy the output into the output continue with open(filename, 'rb') as readfile: outfile.write(readfile.read()) 

Binary Time -

 Shutil - 20.161773920059204 Normal - 17.327500820159912 

Text Time -

 Shutil - 20.47757601737976 Normal - 13.718038082122803 

It appears that shutil works the same in both modes, and text mode is faster than binary.

OS: Mac OS 10.14 Mojave. Macbook Air 2017.

+1
source

You can directly iterate over the lines of a file object without reading all this in memory:

 with open(fname, 'r') as readfile: for line in readfile: outfile.write(line) 
0
source

Source: https://habr.com/ru/post/949835/


All Articles