Faster way of finding strings in a large file using python

Question

Faster way of finding strings in a large file using python

I have two text files to process. Here is my situation:

These two files are extremely large, one 1.21 GB and the other 1.1 GB. Each of them contains about 30 million lines of Chinese lines.
Each line in each file is unique.
I do not need to modify these files, after they are downloaded they will not change.

The fact is that one of these files is damaged. Let me call him N5. N5 should have every line of the line that looks like this: 'a5 b5 c5 d5 e5 \ tf5'

Instead it is: 'a5b5 c5 d5 e5 \ tf5'

I'm trying to restore it from another file, let it name N4, it looks like this: 'a4 b4 c4 d4 \ tf4'

I am trying to use N4 to split a5b5 into N5, which can have three results:

'a4 b4 c4 d4' is equal to 'a5 b5 c5 d5'
'a4 b4 c4 d4' is equal to 'b5 c5 d5 e5'
In N5 there is no match for N4.

In situations 1 and 2, I can get an answer. However, in 3, it takes about 140 seconds to complete the search in N4.

Now I use the list to store N4 and N5, and below is my code to compare them.

# test data
N4 = ['a1 b1 c1 e1\t3', 'a2 b2 c2 e2\t2', 'c3 e3 f3 g3\t3']
N5 = ['a1b1 c1 e1 f1\t2', 'a2b c2 e2 f2\t1', 'b3c3 e3 f3 g3\t3']

# result stroage
list_result = []
list_result_no_none = []

counter_none = 0

list_len = len(N4)

for each_item in N5:
    counter_list_len = 0
    list_str_2 = str(each_item).split(' ')
    list_str_2_2 = str(list_str_2[3]).split('\t')
    str_list_str_2_0 = str(list_str_2[0])
    for each_item in N4:
        list_str_1 = str(each_item).split(' ')
        list_str_1_2 = str(list_str_1[3]).split('\t')

        # if n4 y == n5
        if (str(list_str_1[0])+str(list_str_1[1]) == str(list_str_2[0]) and \
           (str(list_str_1[2]) == str(list_str_2[1]) and \
           (str(list_str_1_2[0]) == str(list_str_2[2])) and \
           (str(list_str_1_2[1]) >= str(list_str_2_2[1])))) :

            list_result.append(list_str_1[0] +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_1_2[0] +' '+ list_str_2[3])
            list_result_no_none.append(list_str_1[0] +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_1_2[0] +' '+ list_str_2[3])
        break

        # if x n4 == n5
        elif ((str(list_str_1[0]) in (str(list_str_2[0]))) and \
            (str(list_str_1[1]) == str(list_str_2[1])) and \
            (str(list_str_1[2]) == str(list_str_2[2])) and \
            (str(list_str_1_2[0]) == str(list_str_2_2[0]) and \
            (str(list_str_1_2[1]) >= str(list_str_2_2[1])))):

            list_result.append(str_list_str_2_0[0:(str(list_str_2[0]).find(str(list_str_1[0])))]\
            +' '+ str_list_str_2_0[(str(list_str_2[0]).find(str(list_str_1[0]))):len(list_str_2[0])]\
            +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_2[3])
        list_result_no_none.append(str_list_str_2_0[0:(str(list_str_2[0]).find(str(list_str_1[0])))]\
            +' '+ str_list_str_2_0[(str(list_str_2[0]).find(str(list_str_1[0]))):len(list_str_2[0])]\
            +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_2[3])
        break

        # not found
        else:
            counter_list_len += 1
            if counter_list_len == list_len:
                list_result.append('none' +' '+ list_str_2[0] +' '+ list_str_2[1] +' '+ list_str_2[2] +' '+ list_str_2[3])
                counter_none += 1


print(list_result)
print(list_result_no_none)
print("Percentage of not found: %.2f" % ((100*(counter_none/len(N5)))) + '%')

It works on a small scale, but a real file requires age.

I am new to python and have little experience working in other programming languages. So if my question looks silly to you, I'm sorry. In addition, I am not a native speaker, so I apologize for my poor English.

+4

python nlp

Yc tsai Nov 21 '15 at 8:18

source share

1 answer

Daniel · Accepted Answer · 2015-11-21T09:08:21+0000

, . N4 , :

def iter_file(filename):
    with open(filename) as inp:
        for line in inp:
            line = line.split(' ')
            yield line[:-1] + line[-1].split('\t')

def do_correction(n4, n5):
    n4 = list(n4)

    for words_n5 in n5:
        for words_n4 in n4:

            # if n4 y == n5
            if (words_n4[0]+words_n4[1] == words_n5[0] and
                words_n4[2] == words_n5[1] and
                words_n4[3] == words_n5[2] and
                words_n4[4] >= words_n5[3]):
                yield words_n4[:-1] + words_n5[3:]
                break

            # if x n4 == n5
            elif (words_n4[0] in words_n5[0] and
                words_n4[1] == words_n5[1] and
                words_n4[2] == words_n5[2] and
                words_n4[3] == words_n5[3] and
                words_n4[4] >= words_n5[4]):
                idx = words_n5[0].find(words_n4[0])
                yield [words_n5[:idx], words_n5[idx:]], words_n5[1:]
                break
        else: # not found
            yield ['none'] + words_n5

with open('corrected', 'w') as output:
    for words in do_correction(iter_file('N4'), iter_file('N5')):
        output.write('%s\t%s' %(' '.join(words[:-1]), words[-1]))

N4 , :

from collections import defaultdict

def iter_file(filename):
    with open(filename) as inp:
        for line in inp:
            line = line.split(' ')
            yield line[:-1] + line[-1].split('\t')

def do_correction(n4, n5):
    n4_dict = defaultdict(list)
    for words_n4 in n4:
        n4[words_n4[2], words_n4[3]].append(words_n4)

    for words_n5 in n5:
        words_n4 = next(
            (words_n4 for words_n4 in n4_dict[words_n5[1], words_n5[2]]
                if (words_n4[0]+words_n4[1] == words_n5[0] and
                words_n4[4] >= words_n5[3])),
            None)
        if words_n4:
            yield words_n4[:-1] + words_n5[3:]
        else:
            words_n4 = next(
                (words_n4 for words_n4 in n4_dict[words_n5[2], words_n5[3]]
                    if (words_n4[0] in words_n5[0] and
                    words_n4[1] == words_n5[1] and
                    words_n4[4] >= words_n5[4])),
                None)
            if words_n4:
                idx = words_n5[0].find(words_n4[0])
                yield [words_n5[:idx], words_n5[idx:]], words_n5[1:]
            else: # not found
                yield ['none'] + words_n5

with open('corrected', 'w') as output:
    for words in do_correction(iter_file('N4'), iter_file('N5')):
        output.write('%s\t%s' %(' '.join(words[:-1]), words[-1]))

Faster way of finding strings in a large file using python

More articles: