I read all the articles I could find, even understood some of them, but as a new Python I still lost a little and hope for help :)
I'm working on a script to parse items of interest from a specific application log file, each line starts with a timestamp that I can match, and I can define two things to determine what I want to capture, some partial content and a line that will the completion of what I want to extract.
My problem is multiline, in most cases each line of the log ends with a newline, but some records contain SQL, which can contain new lines and therefore creates new lines in the log.
So, in a simple case, I can have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
It all looks like one line that I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However, in some cases, there may be line breaks in SQL, so I still want to capture it (and possibly replace line breaks with spaces). I am currently reading a line file at a time, which obviously won't work like that ...
- Do I need to process the entire file in one go? They usually have a size of 20 MB. How can I read the entire file and iterate over it, looking for single or multi-line blocks?
- How can I write a multi-line regex that matches either the entire item on one line, or it spreads over several lines?
My common goal is to parameterize this so that I can use it to retrieve log entries that match the different start line patterns (always the beginning of the line), the end line (where I want to take it) and the value between them as an identifier.
Thanks in advance for your help!
Chris.
import sys, getopt, os, re sourceFolder = 'C:/MaxLogs' logFileName = sourceFolder + "/Test.log" lines = [] print "--- START ----" lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )') lineContains = re.compile('.*BMXAA6720W.*') lineEndsWith = re.compile('(?:.*milliseconds.*)') lines = [] with open(logFileName, 'r') as f: for line in f: if lineStartsWith.match(line) and lineContains.match(line): if lineEndsWith.match(line) : print 'Full Line Found' print line print "- Record Separator -" else: print 'Partial Line Found' print line print "- Record Separator -" print "--- DONE ----"
The next step, for my partial line, I will continue reading until I find lineEndsWith and collect the lines in one block.
I am not an expert, so suggestions are always welcome!
UPDATE So I work, thanks to all the answers that helped me direct things, I understand that this is ugly, and I need to clear my if / elif mess and make it more efficient, but IT WORKING! Thanks for the help.
import sys, getopt, os, re sourceFolder = 'C:/MaxLogs' logFileName = sourceFolder + "/Test.log" print "--- START ----" lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )') lineContains = re.compile('.*BMXAA6720W.*') lineEndsWith = re.compile('(?:.*milliseconds.*)') lines = [] multiLine = False with open(logFileName, 'r') as f: for line in f: if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line): lines.append(line.replace("\n", " ")) elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine: #Found the start of a multi-line entry multiLineString = line multiLine = True elif multiLine and not lineEndsWith.match(line): multiLineString = multiLineString + line elif multiLine and lineEndsWith.match(line): multiLineString = multiLineString + line multiLineString = multiLineString.replace("\n", " ") lines.append(multiLineString) multiLine = False for line in lines: print line