What is wrong with using .readline() ?
The selected pattern is not valid for files opened in text mode. It should work fine on Linux systems, but not on Windows. On Windows, the only way to return to the previous position in a text file is to look for one of the following options:
You cannot calculate positions in text mode in any portable way.
So use .readline() and / or .read() , and .tell() . The problem is resolved; -)
About buffering: whether buffering is used, has nothing to do with access to the file; it is completely related to how the file is opened. Buffering is an implementation detail. In particular, f.readline() will certainly be buffered under covers (unless you explicitly turn off buffering in your open() file access), but in a way that doesn't appear to you. The problems encountered when using a file object as an iterator are related to an additional level of buffering added to the implementation of the file iterator (which file.next() docs causes a "hidden forward read buffer").
To answer your other question, expense:
offset += len(line)
trivially, but, as noted earlier, the “solution” has real problems.
Short course: don't stand prematurely difficult. Do the simplest thing that works (e.g. .readline() + .tell() ) and start worrying only if it turns out to be inadequate.
More details
In fact, there are several levels of buffering. The hardware on your drive has memory buffers. In addition, your operating system also supports memory buffers and usually tries to be “smart” when accessing a file in a uniform pattern, asking the drive to “read ahead” the disk blocks in the direction that you are reading, outside of which you already requested.
CPython I / O builds on top of the C / I platform libraries. C libraries have their own memory buffers. For Python, f.tell() for "proper operation", CPython must use the C libraries in the ways that C dictates.
Now there is nothing special about this “line” (well, not one of the main operating systems). "String" is a programming concept that usually means "up to the next byte \n (Linux), \r (some Mac flavors) or a couple of bytes \r\n (Windows)). OS and C are usually nothing They don’t know about “strings” - they just work with a stream of bytes.
Under the covers of Python, .readline() essentially “reads” one byte at a time until it sees a sequence of bytes at the end of the line ( \n , \r or \r\n ). I put the "read" in quotation marks because there is usually no access to the disk - usually it is just software at different levels, copying bytes from my memory buffers. When access to a disk is involved, it is a thousand times slower.
By doing this “one byte at a time”, C-level libraries retain the correct results for f.tell() . But at a cost: there may be layers of function calls for each byte received.
The Python file explorer reads chunks of bytes at a time into its own memory buffer. “How much” does not matter ;-) The important thing is that it asks the C library to copy several bytes at a time, and then CPython scans its own memory buffer for end-of-line sequences. This reduces the number of function calls required. But at a different price: the idea of the C library about where we are in the file reflects the number of bytes read into the memory buffer of the file iterator, which has nothing to do with the number of bytes that the Python user program selected from this buffer.
So yes, for line in file: is usually the fastest way to go through an entire text file line by line.
Does it matter? The only way to know for sure is by using real time data. If you are reading a 200 + GB file, you will spend thousands of times more time reading a physical disk than at various software levels to look for sequences of bytes at the end of a line.
If it turns out that it matters, and your data and OS are such that you can open the file in binary mode and get the correct results, then the code fragment that you found will give the best of both worlds (the fastest iteration line and the correct byte position for later .seek() 'ing).