Search regex in large file using python

I try to find the "path" token in the file, and then read all the following numbers (the number of digits) as a number (so for ": path, 123" I search for the file, then read the integer 123 in the file). Then read the characters between the current search position and pos + 123 (save them in a list or something else). Then find the โ€œpathโ€ until the next match and repeat the process.

I need a function a bit:

  def fregseek (FILE, current_seek, / regex /):

 .
 .
   value_found =?  # result of reading next N chars after: path, [0-9] +
 .
 .
   return next_start_seek, value_found

A string can have any number of matches for ': path,', and this string can occur in the number of characters specified after ','. I wrote a dirty pile of garbage that is read on every line, and then for each line chomps the first N characters specified in the match, then it continues to process the line until it is eaten. Then it reads the next line, etc.

This is terrible, I donโ€™t want you to delete all lines from a potentially huge file when I really need to search (especially because the new line does not matter, so with the additional processing step just because the lines are easy to extract from files, itโ€™s funny) .

So, here it is, this is my problem that I would like to solve. I need to find a match, read the value, continue from the end of this value, looking for the next match and so on until the file is exhausted.

If anyone can help me, I will be happy to hear from them :)

I would like to avoid non-standard libraries, if possible, I also need the shortest code, but this is the least of my problems (speed and memory consumption are important factors, but I do not want 50 loc extra just to load some library with a little functionality, I could just snatch if I only knew what it was).

I would prefer python code, however, if perl beats python, in this respect I will use perl instead, I am also open to smart sed / awk / bash scripts etc. if they are not terribly slow.

Thank you very well in advance.

+4
source share
2 answers

If you do not need regular expressions, you can do this, just find and chop.

In any case, the trivial solution is to read the entire file in memory and find and slice the resulting str / bytes object.

But this does not work if you cannot (or do not want to) read the entire file in memory.

Fortunately, if you can count on the fact that your files are <<2 GB or you only need to work in 64-bit Python and you are on a reasonable platform (POSIX, modern Windows, etc.), you can save mmap file in memory. The mmap object has a subset of the same methods as strings, so you can just pretend you have a string, just as if you had read the entire file in memory, but you can count on Python and OS to force it work with reasonable efficiency.

Depending on your version of Python, re might not be able to scan mmap, as if it were a string, it might work, but be slow or it might work fine. So, you can try this first, and if it does not throw an exception or goes much slower than you expected, you can:

 def findpaths(fname): with open(fname, 'rb') as f: m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) for match in re.finditer(':path,([0-9]+)', m): yield m[match.end():match.end()+int(match.group(1))] 

(This is the same as BrtH's answer, just using mmap instead of a string and reorganized into a generator instead of a list, although of course you could do the last part by simply replacing the square brackets with brackets).

If you are using an older (or not CPython?) Version of Python that cannot (efficiently) re a mmap , this is a little more complicated:

 def nextdigits(s, start): return ''.join(itertools.takewhile(str.isdigit, itertools.islice(s, start, None))) def findpaths(fname): with open(fname, 'rb') as f: m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) i = 0 while True: n = m.find(':path', i) if n == -1: return countstr = nextdigits(m, n+6) count = int(countstr) n += 6 + len(countstr) yield m[n:n+count] i = n + 6 + count 

This is probably not the fastest way to write a nextdigits function. I'm not sure if that really matters (time to look), but if so, other possibilities are to cut out m[n+6:n+A_BIG_ENOUGH_NUMBER] and reuse it, or write a custom loop , or ... On the other hand, if this is a bottleneck, you can get much more benefits by switching to the interpreter using JIT (PyPy, Jython or IronPython) ...

In my tests, I broke things: findpaths takes an object that looks like a string, and the caller executes a bit with open and mmap and just passes m to findpaths ; I have not done this here for brevity only.

In any case, I tested both versions using the following data:

 BLAH:path,3abcBLAH:path,10abcdefghijklmnBLAH:path,3abc:path,0:path,3abc 

And the result was:

 abc abcdefghij abc abc 

I think right?

If my earlier version made it spin on 100% of the CPU, I would assume that I did not increase i correctly in the loop; which is the most common reason you get this behavior in a narrow syntax loop. In any case, if you can reproduce this with the current version, send the data.

+3
source

You can do this on almost one line in python:

 with open('filename.txt') as f: text = f.read() results = [text[i[0]:i[0] + i[1]] for i in ((m.end(), int(m.group(1))) for m in re.finditer(':path,([0-9]+)', text))] 

Note: untested ...

+2
source

Source: https://habr.com/ru/post/1436477/


All Articles