Gradual matching of regular expressions in streaming data in Python

Question

Gradual matching of regular expressions in streaming data in Python

I have streaming data to multiple TCP sockets continuously. For each of them, I have another regular expression for which I need to pull matches. For example, you can match format numbers ##. # Followed by the letter f:

r = re.compile(rb'([0-9][0-9]\.[0-9])f')

Another may correspond to format numbers ### preceded by the letter Q:

 r = re.compile(rb'Q([0-9][0-9][0-9])')

In fact, expressions can be of arbitrary length and complexity and can be extracted from configuration files and are not known in advance. They are not hardcoded.

When new data bytearray() , I add it to a buffer like bytearray() (here called self.buffer ). Then I call such a function (with self.r is a compiled regular expression):

 def advance(self): m = self.r.search(self.buffer) # No match. Return. if m is None: return None # Match. Advance the buffer and return the matched groups. self.buffer = self.buffer[m.end():] return m.groups()

If there is no match, it returns None. If there is a match, it returns a match and discards the buffer until the end of the match, making itself ready to be called again.

However, this method is not particularly effective. The problem is that the buffer needs to be scanned from the very beginning again and again - whenever new data appears - until a match is found. This can be thousands of times, and millions of characters are scanned multiple times before a match is found, and the beginning of the buffer can be increased.

I can't just discard the contents of the buffer before a match is found, because the match may need the last few bytes of the buffer (or even the entire buffer). After the data arrives, the end of the buffer may be the beginning of a match.

How can I rewrite my advance function to safely discard portions of the buffer that could never contain the beginning of a regular expression so that the entire buffer does not need to be re-checked?

One possibility: is there an alternative to “search” that returns something other than “No” if the reason the match was not found was because the end of the line was reached? And if so, can I get the starting position of a potential match? This would allow me to drop the buffer to this point.

Another possibility: some type of library smart enough to rewrite arbitrary regular expressions so that they can match each other and in a definable way over truncated lines.

I would like to use other features, but they need to work with arbitrary regular expressions, and not just with the simple ones above. Ideally, they will also not include scanning the buffer twice (once to find the actual potential match and once to discard things).

+5

python regex stream

JohnSpeeks Mar 27 '17 at 3:22

source share

1 answer

user2357112 · Accepted Answer · 2017-03-27T05:28:57+0000

The third-party regex module (not re ) offers support for partial matching, which is a partial solution. (Lookbehinds, anchor ^ , zero-width match and \b / \b snap all break in a subtle or not so subtle way when you try to drop the beginning of the window and continue the search. With what extreme cases that I have been thinking about so far, I don’t I’m surprised if there will be more.)

If you go partial=True to regex.match , regex.search , regex.fullmatch or regex.finditer , then in addition to reports of regular, full matches, they also report things that do not match, but can if the line has been expanded:

 In [8]: import regex In [9]: regex.search(r'1234', '', partial=True) Out[9]: <regex.Match object; span=(0, 0), match='', partial=True> In [10]: regex.search(r'1234', '12', partial=True) Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True> In [11]: regex.search(r'1234', '12 123', partial=True) Out[11]: <regex.Match object; span=(3, 6), match='123', partial=True> In [12]: regex.search(r'1234', '1234 123', partial=True) Out[12]: <regex.Match object; span=(0, 4), match='1234'>

You can determine if the match was partial or complete with the match partial attribute:

 In [13]: regex.search(r'1234', '12 123', partial=True).partial Out[13]: True In [14]: regex.search(r'1234', '1234 123', partial=True).partial Out[14]: False

It will report a match as partial if more data can change the result of the match:

 In [21]: regex.search(r'.*', 'asdf', partial=True) Out[21]: <regex.Match object; span=(0, 4), match='asdf', partial=True> In [22]: regex.search(r'ham(?: and eggs)?', 'ham', partial=True) Out[22]: <regex.Match object; span=(0, 3), match='ham', partial=True>

or if more data can lead to a match, will not match:

 In [23]: regex.search(r'1(?!234)', '1', partial=True) Out[23]: <regex.Match object; span=(0, 1), match='1', partial=True> In [24]: regex.search(r'1(?!234)', '13', partial=True) Out[24]: <regex.Match object; span=(0, 1), match='1'>

When you reach the end of the data stream, you must disable partial so that regex knows that this is the end, so partial matches do not hide full matches.

With the partial match information, you can drop everything before the start of the partial match and know that none of the dropped data would be in the match ... but lookbehinds might need data, so it would be dirty to do additional work to support lookbehind if you do it. ^ also get confused at the beginning of a line change, \b / \b will not know if the word character was at the end of the discarded data, and it would be difficult to get a zero-right width matching behavior for any definition of "correct" that you choose. I suspect that some other advanced regex features may also interact strangely if you delete data this way; regex has many features.

Gradual matching of regular expressions in streaming data in Python

More articles: