If you do not need regular expressions, you can do this, just find and chop.
In any case, the trivial solution is to read the entire file in memory and find and slice the resulting str / bytes object.
But this does not work if you cannot (or do not want to) read the entire file in memory.
Fortunately, if you can count on the fact that your files are <<2 GB or you only need to work in 64-bit Python and you are on a reasonable platform (POSIX, modern Windows, etc.), you can save mmap file in memory. The mmap object has a subset of the same methods as strings, so you can just pretend you have a string, just as if you had read the entire file in memory, but you can count on Python and OS to force it work with reasonable efficiency.
Depending on your version of Python, re might not be able to scan mmap, as if it were a string, it might work, but be slow or it might work fine. So, you can try this first, and if it does not throw an exception or goes much slower than you expected, you can:
def findpaths(fname): with open(fname, 'rb') as f: m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) for match in re.finditer(':path,([0-9]+)', m): yield m[match.end():match.end()+int(match.group(1))]
(This is the same as BrtH's answer, just using mmap instead of a string and reorganized into a generator instead of a list, although of course you could do the last part by simply replacing the square brackets with brackets).
If you are using an older (or not CPython?) Version of Python that cannot (efficiently) re a mmap , this is a little more complicated:
def nextdigits(s, start): return ''.join(itertools.takewhile(str.isdigit, itertools.islice(s, start, None))) def findpaths(fname): with open(fname, 'rb') as f: m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) i = 0 while True: n = m.find(':path', i) if n == -1: return countstr = nextdigits(m, n+6) count = int(countstr) n += 6 + len(countstr) yield m[n:n+count] i = n + 6 + count
This is probably not the fastest way to write a nextdigits function. I'm not sure if that really matters (time to look), but if so, other possibilities are to cut out m[n+6:n+A_BIG_ENOUGH_NUMBER] and reuse it, or write a custom loop , or ... On the other hand, if this is a bottleneck, you can get much more benefits by switching to the interpreter using JIT (PyPy, Jython or IronPython) ...
In my tests, I broke things: findpaths takes an object that looks like a string, and the caller executes a bit with open and mmap and just passes m to findpaths ; I have not done this here for brevity only.
In any case, I tested both versions using the following data:
BLAH:path,3abcBLAH:path,10abcdefghijklmnBLAH:path,3abc:path,0:path,3abc
And the result was:
abc abcdefghij abc abc
I think right?
If my earlier version made it spin on 100% of the CPU, I would assume that I did not increase i correctly in the loop; which is the most common reason you get this behavior in a narrow syntax loop. In any case, if you can reproduce this with the current version, send the data.