Findall / finditer in a stream?

Is there a way to get re.findall functionality re.findall or better yet, re.finditer applied to a stream (i.e. a file descriptor open for reading)?

Please note that I do not assume that the pattern to be matched is entirely contained in one input line (i.e. multi-line patterns are allowed). I also do not assume the maximum length of a match.

It is true that at this level of generality, you can specify a regular expression that would require the regular expression engine to have access to the entire string (for example, r'(?sm).*' ), And, of course, that means read the entire file in memory, but at the moment I'm not interested in this worst case scenario. In the end, it is entirely possible to write multi-line regular expressions that would not require reading the entire file into memory.

Is it possible to access the main automaton (or what is used internally) from a compiled regular expression to pass a stream of characters to it?

Thanks!

Edit: added clarifications regarding multi-line patterns and match lengths in response to Tim Pitzker and rplnt answers.

+4
source share
2 answers

This is possible if you know that regular expression matching will never span a new line.

Then you can just do

 for line in file: result = re.finditer(regex, line) # do something... 

If matches can span multiple lines, you need to read the entire file in memory. Otherwise, as you know, will your match already be held, or if any content continues ahead, is it impossible to match or the match will fail because the file is not read enough?

Edit:

Theoretically, this can be done. The regex engine should check to see if at any time during the match attempt it reaches the end of the current read part of the stream, and if that happens, read on (possibly before EOF). But the Python engine does not.

Edit 2:

I looked at Python stdlib re.py and its related modules. The actual creation of the regex object, including its .match() method and others, is done in extension C. Thus, you cannot access and disable it in order to process streams as well, unless you directly edit C sources and create your own Python version.

+3
source

It could be implemented in regexp with a known maximum length. Either no + / *, or those where you know the maximum number of repetitions. If you know this, you can read the file in chunks and match them, giving the result. You could also run the regex on an overlapping fragment, than covering the case where the regex matches, but was stopped at the end of the line.

some pseudo code (python):

 overlap_tail = '' matched = {} for chunk in file.stream(chunk_size): # calculate chunk_start for result in finditer(match, overlap_tail+chunk): if not chunk_start + result.start() in matched: yield result matched[chunk_start + result.start()] = result # delete old results from dict overlap_tail = chunk[-max_re_len:] 

Just an idea, but I hope you get what I'm trying to achieve. You will need to think that the file (stream) may end in some other cases. But I think this can be done (if the length of the regular expression is limited (known)).

+2
source

Source: https://habr.com/ru/post/1402219/


All Articles