We need to analyze several log files and run some statistics on the found log entries (such as the number of occurrences of certain messages, spikes in occurrences, etc.). The problem is writing a log parser that processes several log formats and will allow me to add a new log format with very little work.
To simplify the task, I only look at the logs, which basically will look something like this:
[11/17/11 14:07:14:030 EST] MyXmlParser E Premature end of file
therefore, each log entry will contain timestamp, originator(log messages), leveland log message. One important detail is that a message can have more than one line (e.g. stacktrace). Another instance of a log entry may be:
17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file
I am looking for a good way to specify the log format, as well as the most suitable technology for implementing the parser. Although I speak of regular expressions, I think it will be difficult to deal with situations such as multi-line messages (e.g. stacktrace).
In fact, the task of writing a parser for a specific log format does not sound so easy when I consider the possibility of multi-line messages. How are you going to parse these files?
Ideally, I could specify something like this as the log format:
[%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE
or
%TIMESTAMP %LEVEL %ORIGIN - %MESSAGE
Obviously, I would have to assign the correct converter for each field so that it would handle it correctly (for example, a timestamp).
Can someone give me some good ideas on how to implement this in a reliable and modular way (I use Java)?