In python, can I iterate over large text files using buffers and get the correct file position at the same time?

I am trying to find some keywords through a large text file (~ 232 GB). I want to use buffering for speed tasks, and also want to record the starting positions of lines containing these keywords.

I have seen many posts discussing similar issues. However, these buffered solutions (using the file as an iterator) cannot give the correct file position, and these solutions give the correct file positions, usually just use f.readline() , which does not use buffering.

The only answer I saw that both can do is here :

 # Read in the file once and build a list of line offsets line_offset = [] offset = 0 for line in file: line_offset.append(offset) offset += len(line) file.seek(0) # Now, to skip to line n (with the first line being line 0), just do file.seek(line_offset[n]) 

However, I am not sure if the offset += len(line) operation will cost extra time. Is there a more direct way to do this?

UPDATE:

I did some time, but it seems that .readline() much slower than using a file object as an iterator, on python 2.7.3 . I used the following code

 #!/usr/bin/python from timeit import timeit MAX_LINES = 10000000 # use file object as iterator def read_iter(): with open('tweets.txt','r') as f: lino = 0 for line in f: lino+=1 if lino == MAX_LINES: break # use .readline() def read_readline(): with open('tweets.txt', 'r') as f: lino = 0 for line in iter(f.readline,''): lino+=1 if lino == MAX_LINES: break # use offset+=len(line) to simulate f.tell() under binary mode def read_iter_tell(): offset = 0 with open('tweets.txt','rb') as f: lino = 0 for line in f: lino+=1 offset+=len(line) if lino == MAX_LINES: break # use f.tell() with .readline() def read_readline_tell(): with open('tweets.txt', 'rb') as f: lino = 0 for line in iter(f.readline,''): lino+=1 offset = f.tell() if lino == MAX_LINES: break print ("iter: %f"%timeit("read_iter()",number=1,setup="from __main__ import read_iter")) print ("readline: %f"%timeit("read_readline()",number=1,setup="from __main__ import read_readline")) print ("iter_tell: %f"%timeit("read_iter_tell()",number=1,setup="from __main__ import read_iter_tell")) print ("readline_tell: %f"%timeit("read_readline_tell()",number=1,setup="from __main__ import read_readline_tell")) 

And the result looks like this:

 iter: 5.079951 readline: 37.333189 iter_tell: 5.775822 readline_tell: 38.629598 
+6
source share
1 answer

What is wrong with using .readline() ?

The selected pattern is not valid for files opened in text mode. It should work fine on Linux systems, but not on Windows. On Windows, the only way to return to the previous position in a text file is to look for one of the following options:

  • 0 (beginning of file).

  • End of file.

  • The position previously returned by f.tell() .

You cannot calculate positions in text mode in any portable way.

So use .readline() and / or .read() , and .tell() . The problem is resolved; -)

About buffering: whether buffering is used, has nothing to do with access to the file; it is completely related to how the file is opened. Buffering is an implementation detail. In particular, f.readline() will certainly be buffered under covers (unless you explicitly turn off buffering in your open() file access), but in a way that doesn't appear to you. The problems encountered when using a file object as an iterator are related to an additional level of buffering added to the implementation of the file iterator (which file.next() docs causes a "hidden forward read buffer").

To answer your other question, expense:

 offset += len(line) 

trivially, but, as noted earlier, the “solution” has real problems.

Short course: don't stand prematurely difficult. Do the simplest thing that works (e.g. .readline() + .tell() ) and start worrying only if it turns out to be inadequate.

More details

In fact, there are several levels of buffering. The hardware on your drive has memory buffers. In addition, your operating system also supports memory buffers and usually tries to be “smart” when accessing a file in a uniform pattern, asking the drive to “read ahead” the disk blocks in the direction that you are reading, outside of which you already requested.

CPython I / O builds on top of the C / I platform libraries. C libraries have their own memory buffers. For Python, f.tell() for "proper operation", CPython must use the C libraries in the ways that C dictates.

Now there is nothing special about this “line” (well, not one of the main operating systems). "String" is a programming concept that usually means "up to the next byte \n (Linux), \r (some Mac flavors) or a couple of bytes \r\n (Windows)). OS and C are usually nothing They don’t know about “strings” - they just work with a stream of bytes.

Under the covers of Python, .readline() essentially “reads” one byte at a time until it sees a sequence of bytes at the end of the line ( \n , \r or \r\n ). I put the "read" in quotation marks because there is usually no access to the disk - usually it is just software at different levels, copying bytes from my memory buffers. When access to a disk is involved, it is a thousand times slower.

By doing this “one byte at a time”, C-level libraries retain the correct results for f.tell() . But at a cost: there may be layers of function calls for each byte received.

The Python file explorer reads chunks of bytes at a time into its own memory buffer. “How much” does not matter ;-) The important thing is that it asks the C library to copy several bytes at a time, and then CPython scans its own memory buffer for end-of-line sequences. This reduces the number of function calls required. But at a different price: the idea of ​​the C library about where we are in the file reflects the number of bytes read into the memory buffer of the file iterator, which has nothing to do with the number of bytes that the Python user program selected from this buffer.

So yes, for line in file: is usually the fastest way to go through an entire text file line by line.

Does it matter? The only way to know for sure is by using real time data. If you are reading a 200 + GB file, you will spend thousands of times more time reading a physical disk than at various software levels to look for sequences of bytes at the end of a line.

If it turns out that it matters, and your data and OS are such that you can open the file in binary mode and get the correct results, then the code fragment that you found will give the best of both worlds (the fastest iteration line and the correct byte position for later .seek() 'ing).

+8
source

Source: https://habr.com/ru/post/956337/


All Articles