The fastest way to concatenate multiple columns of files - Python

What is the fastest way to concatenate multiple columns of files (in Python)?

Suppose I have two files with 1,000,000,000 lines and ~ 200 UTF8 characters per line.

Method 1: Cheating withpaste

I could merge the two files on the linux system using pastethe shell, and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):
    file1 = os.path.join(file_path, file1)
    file2 = os.path.join(file_path, file2)
    output = os.path.join(output_path, output)
    if not os.path.exists(output):
        os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

Method 2: Use the nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):
    with open(output, 'wb') as fout:
        with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
            for line1, line2 in zip(fin1, fin2):
                fout.write(line1 + '\t' + line2)

Method 3: Usefileinput

Does fileinputiterate through files in parallel? Or will they loop through each file after another?

If this is the first, I would suggest that it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):
    with fileinput.input(files=(file1, file2)) as f:
        for line in f:
            line1, line2 = process(line)
            fout.write(line1 + '\t' + line2)

Method 4 . Treat them likecsv

with open(output, 'wb') as fout:
    with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
        writer = csv.writer(w)
        reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
        for line1, line2 in zip(reader1, reader2):
          writer.writerow(line1 + '\t' + line2)

, ?

? ?

, , , \t?

? ?

+4
4

for writelines, xp zip izip from itertools 2. paste .

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(*line) for line in izip(fin1, fin2))

\t , repeat itertools;

    fout.writelines(b"{}{}{}".format(*line) for line in izip(fin1, repeat(b'\t'), fin2))

, izip.

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(line, next(fin2)) for line in fin1)
+2

. . ( 0.002 , 6 , , , 1M , , 1K , ).

:

  • , , , . ( , python 2, zip itertools.izip)
  • , "% s% s".format() ; .
  • . .
  • , , , (, , f1.readlines(1024 * 1000), ).

:

def concat_iter(file1, file2, output):
    with open(output, 'w', 1024) as fo, \
        open(file1, 'r') as f1, \
        open(file2, 'r') as f2:
        fo.write("".join("{}\t{}".format(l1, l2) 
           for l1, l2 in izip(f1.readlines(1024), 
                              f2.readlines(1024))))

Profiler.

, zip ( , , / ).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)
    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}
    1    1.072    1.072    1.072    1.072 {zip}

Profiler:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 
     3731 function calls in 0.002 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)
    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)
 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}

python 3 , , - .

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

, :

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0


$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696
+2

timeit. doc   .

%%timeit Jupyter. %%timeit func(data), . .

0

# 1 , ( Python) . .

If you want to cheat, you might also consider writing your own C extension for Python - it can be even faster, depending on your coding skills.

I am afraid that method # 4 will not work, since you are merging lists with a string. I would go with writer.writerow(line1 + line2). You can use the parameter delimiterboth csv.reader, and csv.writerto configure the separator (see https://docs.python.org/2/library/csv.html ).

0
source

Source: https://habr.com/ru/post/1656897/


All Articles