The most efficient way to compare multiple files in python

My problem is this. I have one file with 3000 rows and 8 columns (space limited). It is important that the first column is a number from 1 to 22. Thus, in principle, divide-n-conquer, I divided the source file into 22 subfiles, depending on the value of the first column.

And I have some result files. Each of the 15 sources contains 1 result file. But the result file is too large, so I again applied divide-n-conquer to split each of the 15 results into 22 subfiles.

The file structure is as follows:

Original_file                Studies
    split_1                      study1
                                     split_1, split_2, ...
    split_2                      study2
                                     split_1, split_2, ...
    split_3                      ...
    ...                          study15
                                     split_1, split_2, ...
    split_22

Thus, by doing this, we pay a small invoice at the beginning, but all these split files will be deleted at the end. so it doesn't really matter.

, .

, :

Algorithm:
    for i in range(1,22):
        for j in range(1,15)
            compare (split_i of original file) with the jth studys split_i
            if one value on a specific column matches:
                create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.

? 300 1,5 .

Python:

folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
    for i in range(1, 23):
        chr = i
        small_file = "split_"+str(chr)+".txt"
        with open(small_file, 'r') as sf:
            for sline in sf: #small_files
                sf_parts = sline.split(' ')
                for f in folders:
                    file_to_compare_with = f + "split_" + str(chr) + ".txt"
                    with open(file_to_compare_with, 'r') as cf: #comparison files
                        for cline in cf:
                            cf_parts = cline.split(' ')
                            if cf_parts[0] == sf_parts[1]:
                               to_write = ' '.join(cf_parts+sf_parts) 
                               outfile.write(to_write)

4 , , , , . ...

+4
1

, . :

with open("output_file", 'w') as outfile:
    for i in range(1,23):
        dict1 = {}  # use a dictionary to map values from the inital file
        with open("split_i", 'r') as split:
            next(split) #skip the header
            line_list = line.split(delimiter)
            for line in split:
                dict1[line_list[whatever_key_u_use_as_id]] = line_list

            compare_dict = {}
            for f in folders:
                with open("each folder", 'r') as comp:
                    next(comp) #skip the header
                    for cline in comp:
                        cparts = cline.split('delimiter')
                        compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
            for key in dict1:
                if key in compare_dict:
                    outfile.write("write your data")
outfile.close()

~ 10 . , . , , , !

+1

Source: https://habr.com/ru/post/1660189/


All Articles