I have two text files to process. Here is my situation:
- These two files are extremely large, one 1.21 GB and the other 1.1 GB. Each of them contains about 30 million lines of Chinese lines.
- Each line in each file is unique.
- I do not need to modify these files, after they are downloaded they will not change.
The fact is that one of these files is damaged. Let me call him N5. N5 should have every line of the line that looks like this: 'a5 b5 c5 d5 e5 \ tf5'
Instead it is: 'a5b5 c5 d5 e5 \ tf5'
I'm trying to restore it from another file, let it name N4, it looks like this: 'a4 b4 c4 d4 \ tf4'
I am trying to use N4 to split a5b5 into N5, which can have three results:
- 'a4 b4 c4 d4' is equal to 'a5 b5 c5 d5'
- 'a4 b4 c4 d4' is equal to 'b5 c5 d5 e5'
- In N5 there is no match for N4.
In situations 1 and 2, I can get an answer. However, in 3, it takes about 140 seconds to complete the search in N4.
Now I use the list to store N4 and N5, and below is my code to compare them.
N4 = ['a1 b1 c1 e1\t3', 'a2 b2 c2 e2\t2', 'c3 e3 f3 g3\t3']
N5 = ['a1b1 c1 e1 f1\t2', 'a2b c2 e2 f2\t1', 'b3c3 e3 f3 g3\t3']
list_result = []
list_result_no_none = []
counter_none = 0
list_len = len(N4)
for each_item in N5:
counter_list_len = 0
list_str_2 = str(each_item).split(' ')
list_str_2_2 = str(list_str_2[3]).split('\t')
str_list_str_2_0 = str(list_str_2[0])
for each_item in N4:
list_str_1 = str(each_item).split(' ')
list_str_1_2 = str(list_str_1[3]).split('\t')
if (str(list_str_1[0])+str(list_str_1[1]) == str(list_str_2[0]) and \
(str(list_str_1[2]) == str(list_str_2[1]) and \
(str(list_str_1_2[0]) == str(list_str_2[2])) and \
(str(list_str_1_2[1]) >= str(list_str_2_2[1])))) :
list_result.append(list_str_1[0] +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_1_2[0] +' '+ list_str_2[3])
list_result_no_none.append(list_str_1[0] +' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_1_2[0] +' '+ list_str_2[3])
break
elif ((str(list_str_1[0]) in (str(list_str_2[0]))) and \
(str(list_str_1[1]) == str(list_str_2[1])) and \
(str(list_str_1[2]) == str(list_str_2[2])) and \
(str(list_str_1_2[0]) == str(list_str_2_2[0]) and \
(str(list_str_1_2[1]) >= str(list_str_2_2[1])))):
list_result.append(str_list_str_2_0[0:(str(list_str_2[0]).find(str(list_str_1[0])))]\
+' '+ str_list_str_2_0[(str(list_str_2[0]).find(str(list_str_1[0]))):len(list_str_2[0])]\
+' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_2[3])
list_result_no_none.append(str_list_str_2_0[0:(str(list_str_2[0]).find(str(list_str_1[0])))]\
+' '+ str_list_str_2_0[(str(list_str_2[0]).find(str(list_str_1[0]))):len(list_str_2[0])]\
+' '+ list_str_1[1] +' '+ list_str_1[2] +' '+ list_str_2[3])
break
else:
counter_list_len += 1
if counter_list_len == list_len:
list_result.append('none' +' '+ list_str_2[0] +' '+ list_str_2[1] +' '+ list_str_2[2] +' '+ list_str_2[3])
counter_none += 1
print(list_result)
print(list_result_no_none)
print("Percentage of not found: %.2f" % ((100*(counter_none/len(N5)))) + '%')
It works on a small scale, but a real file requires age.
I am new to python and have little experience working in other programming languages. So if my question looks silly to you, I'm sorry. In addition, I am not a native speaker, so I apologize for my poor English.
source
share