What is the fastest way to concatenate multiple columns of files (in Python)?
Suppose I have two files with 1,000,000,000 lines and ~ 200 UTF8 characters per line.
Method 1: Cheating withpaste
I could merge the two files on the linux system using pastethe shell, and I could cheat using os.system, i.e.:
def concat_files_cheat(file_path, file1, file2, output_path, output):
file1 = os.path.join(file_path, file1)
file2 = os.path.join(file_path, file2)
output = os.path.join(output_path, output)
if not os.path.exists(output):
os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)
Method 2: Use the nested context manager with zip:
def concat_files_zip(file_path, file1, file2, output_path, output):
with open(output, 'wb') as fout:
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
for line1, line2 in zip(fin1, fin2):
fout.write(line1 + '\t' + line2)
Method 3: Usefileinput
Does fileinputiterate through files in parallel? Or will they loop through each file after another?
If this is the first, I would suggest that it would look like this:
def concat_files_fileinput(file_path, file1, file2, output_path, output):
with fileinput.input(files=(file1, file2)) as f:
for line in f:
line1, line2 = process(line)
fout.write(line1 + '\t' + line2)
Method 4 . Treat them likecsv
with open(output, 'wb') as fout:
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
writer = csv.writer(w)
reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
for line1, line2 in zip(reader1, reader2):
writer.writerow(line1 + '\t' + line2)
, ?
? ?
, , , \t?
? ?