The "for line in file object" method for reading files

Question

The "for line in file object" method for reading files

I am trying to find the best way to read / process strings for a super large file. Here I just try

for line in f:

Part of my script is as follows:

 o=gzip.open(file2,'w') LIST=[] f=gzip.open(file1,'r'): for i,line in enumerate(f): if i%4!=3: LIST.append(line) else: LIST.append(line) b1=[ord(x) for x in line] ave1=(sum(b1)-10)/float(len(line)-1) if (ave1 < 84): del LIST[-4:] output1=o.writelines(LIST)

My file1 is about 10 GB; and when I run the script, the memory usage just increases to 15 GB without any output. This means that the computer is still trying to read the entire file in memory first, right? This is really no different than using readlines()

However, in the post: Various ways of reading big data in python, Shrika told me: The for line in f treats the file object f as iterable, which automatically uses buffered input and memory management, so you don’t have to worry about big files.

But obviously, I still need to worry about large files. I'm really confused. THX

edit: Every 4 lines belong to my group. The goal is to do some calculations on every fourth line; and based on this calculation, decide if we need to add these 4 lines. So writing lines is my goal.

+1

python line

壮志饥餐法轮肉笑谈渴饮台独血 Dec 10 '11 at 4:31

source share

5 answers

The reason memory is stored inc. even after using enumerator happens because you use LIST.append(line) . This basically accumulates all the lines of the file in the list. Obviously, all this sits in memory. You need to find a way not to accumulate such lines. Read, process and go to the next.

Another way you could do this is to read your file in chunks (actually reading 1 line at a time can pass by this criterion, 1chunk == 1line), that is, read a small part of the file process, and then read the next fragment, etc. I still argue that this is the best way to read files in python large or small.

 with open(...) as f: for line in f: <do something with line>

The with statement handles opening and closing a file, including if an exception is thrown in the inner block. for line in f treats the file object f as iterative, which automatically uses buffered input and memory management, so you don’t have to worry about large files.

+4

Srikar appalaraju Dec 10 '11 at 4:36

source share

Since you add all the lines to the LIST list and only occasionally delete some lines from it, LIST , we get longer and longer. All the lines that you store in LIST take up memory. Do not keep all the lines in the list unless you want them to occupy memory.

Also your script doesn't seem to display anything anywhere, so the point of all this is not very clear.

0

sth Dec 10 '11 at 4:42

source share

Well, you know that your problem is already from other comments / answers, but let me just point it out.

You read only one line at a time in memory, but you store a significant portion of them in memory by adding to the list.

To avoid this, you need to save something in the file system or database (on disk) for later search if your algorithm is quite complicated.

From what I see, it seems you can easily write down the output gradually. i.e. You are currently using a list to store valid lines for writing to output , as well as temporary lines that you can delete at some point. To be efficient with memory, you want to write lines from your temporary list as soon as you know that this is a valid output.

In short, use your list to store only temporary data necessary to perform your calculations, and if you have data ready for action, you can simply write it to disk and delete it from main memory (in python, this would mean that you have more no links to it.)

0

Derek litz Dec 10 '11 at 15:08

source share

If you do not use the with statement, you must close the file handlers:

 o.close() f.close()

0

eyquem Dec 10 '11 at 16:35

source share

Doug swain · Accepted Answer · 2011-12-10T15:37:24+0000

It looks like at the end of this function you take all the lines that you read in memory, and then immediately write them to a file. Perhaps you can try this process:

Read the necessary lines in memory (first 3 lines).
On line 4, add a line and do the calculations.
If your calculation is what you are looking for, dump the values in your collection to a file.
Regardless of the following, create a new instance of the collection.

I have not tried this, but might look something like this:

 o=gzip.open(file2,'w') f=gzip.open(file1,'r'): LIST=[] for i,line in enumerate(f): if i % 4 != 3: LIST.append(line) else: LIST.append(line) b1 = [ord(x) for x in line] ave1 = (sum(b1) - 10) / float(len(line) - 1 # If we've found what we want, save them to the file if (ave1 >= 84): o.writelines(LIST) # Release the values in the list by starting a clean list to work with LIST = []

EDIT: As a thought, because since your file is so large, this might not be the best method because of all the lines you would have to write to the file, but it might be worth exploring independently.

The "for line in file object" method for reading files

More articles: