I am trying to find the best way to read / process strings for a super large file. Here I just try
for line in f:
Part of my script is as follows:
o=gzip.open(file2,'w') LIST=[] f=gzip.open(file1,'r'): for i,line in enumerate(f): if i%4!=3: LIST.append(line) else: LIST.append(line) b1=[ord(x) for x in line] ave1=(sum(b1)-10)/float(len(line)-1) if (ave1 < 84): del LIST[-4:] output1=o.writelines(LIST)
My file1 is about 10 GB; and when I run the script, the memory usage just increases to 15 GB without any output. This means that the computer is still trying to read the entire file in memory first, right? This is really no different than using readlines()
However, in the post: Various ways of reading big data in python, Shrika told me: The for line in f treats the file object f as iterable, which automatically uses buffered input and memory management, so you don’t have to worry about big files.
But obviously, I still need to worry about large files. I'm really confused. THX
edit: Every 4 lines belong to my group. The goal is to do some calculations on every fourth line; and based on this calculation, decide if we need to add these 4 lines. So writing lines is my goal.
source share