Python multi-line matching

Question

Python multi-line matching

I read all the articles I could find, even understood some of them, but as a new Python I still lost a little and hope for help :)

I'm working on a script to parse items of interest from a specific application log file, each line starts with a timestamp that I can match, and I can define two things to determine what I want to capture, some partial content and a line that will the completion of what I want to extract.

My problem is multiline, in most cases each line of the log ends with a newline, but some records contain SQL, which can contain new lines and therefore creates new lines in the log.

So, in a simple case, I can have this:

[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)

It all looks like one line that I can match with this:

 re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')

However, in some cases, there may be line breaks in SQL, so I still want to capture it (and possibly replace line breaks with spaces). I am currently reading a line file at a time, which obviously won't work like that ...

Do I need to process the entire file in one go? They usually have a size of 20 MB. How can I read the entire file and iterate over it, looking for single or multi-line blocks?
How can I write a multi-line regex that matches either the entire item on one line, or it spreads over several lines?

My common goal is to parameterize this so that I can use it to retrieve log entries that match the different start line patterns (always the beginning of the line), the end line (where I want to take it) and the value between them as an identifier.

Thanks in advance for your help!

Chris.

 import sys, getopt, os, re sourceFolder = 'C:/MaxLogs' logFileName = sourceFolder + "/Test.log" lines = [] print "--- START ----" lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )') lineContains = re.compile('.*BMXAA6720W.*') lineEndsWith = re.compile('(?:.*milliseconds.*)') lines = [] with open(logFileName, 'r') as f: for line in f: if lineStartsWith.match(line) and lineContains.match(line): if lineEndsWith.match(line) : print 'Full Line Found' print line print "- Record Separator -" else: print 'Partial Line Found' print line print "- Record Separator -" print "--- DONE ----"

The next step, for my partial line, I will continue reading until I find lineEndsWith and collect the lines in one block.

I am not an expert, so suggestions are always welcome!

UPDATE So I work, thanks to all the answers that helped me direct things, I understand that this is ugly, and I need to clear my if / elif mess and make it more efficient, but IT WORKING! Thanks for the help.

 import sys, getopt, os, re sourceFolder = 'C:/MaxLogs' logFileName = sourceFolder + "/Test.log" print "--- START ----" lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )') lineContains = re.compile('.*BMXAA6720W.*') lineEndsWith = re.compile('(?:.*milliseconds.*)') lines = [] multiLine = False with open(logFileName, 'r') as f: for line in f: if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line): lines.append(line.replace("\n", " ")) elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine: #Found the start of a multi-line entry multiLineString = line multiLine = True elif multiLine and not lineEndsWith.match(line): multiLineString = multiLineString + line elif multiLine and lineEndsWith.match(line): multiLineString = multiLineString + line multiLineString = multiLineString.replace("\n", " ") lines.append(multiLineString) multiLine = False for line in lines: print line

+6

python regex

Chris Aug 28 '13 at 17:30

source share

2 answers

abarnert · Answer 1 · 2013-08-28T17:56:31+0000

Do I need to process the entire file in one go? They usually have a size of 20 MB. How can I read the entire file and iterate through it to search for single or multi-line blocks?

There are two options.

You can read a block of blocks by frame, making sure that at the end of each block, attach any “remaining” bits to the beginning of the next and search for each block. Of course, you will need to figure out what is considered “remaining” by looking at what your data format is and what might correspond to your regular expression, and, theoretically, it is possible that several blocks are considered left over ...

Or you could just mmap file. Mmap acts like bytes (or like str in Python 2.x), and leaves it in the OS to process the swap blocks of inputs and outputs as needed. If you are not trying to deal with absolutely huge files (gigabytes in 32-bit, and even more so in 64-bit), this is trivial and effective:

 with open('bigfile', 'rb') as f: with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m: for match in compiled_re.finditer(m): do_stuff(match)

In older versions of Python, mmap not a context manager, so you need to wrap contextlib.closing around it (or just use explicit close if you want).

How can I write a multi-line RegEx that matches either the entire item on one line or is spread across multiple lines?

You can use the DOTALL flag, which makes a match . new line. Instead, you can use the MULTILINE flag and put the appropriate $ and / or ^ characters, but this makes simple cases more difficult, and this is rarely necessary. Here is an example with DOTALL (using a simpler regex to make it more obvious):

 >>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)""" >>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)""" >>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL) >>> r.findall(s1) ['8/21/13 11:30:33:557 PDF'] >>> r.findall(s2) ['8/21/13 11:30:33:557 PDF']

As you can see, the second .*? matches a new line as easily as space.

If you're just trying to treat the new line as a space, you also don't need to; '\s' already catches newline.

For instance:

 >>> s1 = 'abc def\nghi\n' >>> s2 = 'abc\ndef\nghi\n' >>> r = re.compile(r'abc\s+def') >>> r.findall(s1) ['abc def'] >>> r.findall(s2) ['abc\ndef']

Chrismit · Answer 2 · 2013-08-28T18:01:39+0000

You can read the entire file into a line, and then you can use re.split to list all the entries separated by time points. Here is an example:

 f = open(...) allLines = ''.join(f.readlines()) entries = re.split(regex, allLines)

Python multi-line matching

More articles: