I am trying to find an efficient way to read a very large text file (about 2,000,000 lines). About 90% of these lines (actually 90%) are in three-column format and are used to store a sparse matrix.
Here is what I did. First of all, I am dealing with the first 10% of the file:
i=1
cpt=0
skip=0
finnum=0
indice=1
vec=[]
mat=[]
for line in fileinput.input("MY_TEXT_FILE.TXT"):
if i==1:
skip = 1
if (finnum == 0)and(skip==0):
tline=shlex.split(line)
ind_loc=0
while ind_loc<len(tline):
if (int(tline[ind_loc])!=0):
vec.append(int(tline[ind_loc]))
ind_loc=ind_loc+1
if (finnum == 1)and(skip==0):
print('finnum = 1')
h=input()
break
if (' 0' in line):
finnum = 1
if skip == 0:
i=i+1
else:
skip=0
i=i+1
cpt=cpt+1
Then I extract the remaining 90% to the list:
matrix=[]
with open('MY_TEXT_FILE.TXT') as f:
for i in range(cpt):
f.next()
for line in f:
matrix.append(line)
This allows you to read a text file with low memory consumption very quickly. The disadvantage is that the matrix is a list of rows, each row of which looks something like this:
>>> matrix[23]
' 5 11 8.320234929063493E-008\n'
shlex.split, , .
?
, - , :
A=[0]*len(matrix)
B=[0]*len(matrix)
C=[0]*len(matrix)
for i in range(len(matrix)):
line = shlex.split(matrix[i])
A[i]=float(line[0])
B[i]=float(line[1])
C[i]=float(line[2])