My problem is this. I have one file with 3000 rows and 8 columns (space limited). It is important that the first column is a number from 1 to 22. Thus, in principle, divide-n-conquer, I divided the source file into 22 subfiles, depending on the value of the first column.
And I have some result files. Each of the 15 sources contains 1 result file. But the result file is too large, so I again applied divide-n-conquer to split each of the 15 results into 22 subfiles.
The file structure is as follows:
Original_file Studies
split_1 study1
split_1, split_2, ...
split_2 study2
split_1, split_2, ...
split_3 ...
... study15
split_1, split_2, ...
split_22
Thus, by doing this, we pay a small invoice at the beginning, but all these split files will be deleted at the end. so it doesn't really matter.
, .
, :
Algorithm:
for i in range(1,22):
for j in range(1,15)
compare (split_i of original file) with the jth studys split_i
if one value on a specific column matches:
create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.
? 300 1,5 .
Python:
folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
for i in range(1, 23):
chr = i
small_file = "split_"+str(chr)+".txt"
with open(small_file, 'r') as sf:
for sline in sf:
sf_parts = sline.split(' ')
for f in folders:
file_to_compare_with = f + "split_" + str(chr) + ".txt"
with open(file_to_compare_with, 'r') as cf:
for cline in cf:
cf_parts = cline.split(' ')
if cf_parts[0] == sf_parts[1]:
to_write = ' '.join(cf_parts+sf_parts)
outfile.write(to_write)
4 , , , , . ...