Incremental but complete parsing with PyParsing?

Question

Incremental but complete parsing with PyParsing?

I use PyParsing to parse fairly large text files with a C-like format ( brackets and semicolons and all that).

PyParsing works just fine, but it is slow and consumes a very large amount of memory due to the size of my files.

In this regard, I wanted to try to implement an incremental parsing method, in which I would parse the top-level elements of the source file one by one. The pyparsing scanString method seems like an obvious way to do this. However, I want to make sure that there is no invalid / unchecked text between the sections processed by scanString , and cannot find a good way to do this.

Here is a simplified example that shows the problem I am facing:

 sample="""f1(1,2,3); f2_no_args( ); # comment out: foo(4,5,6); bar(7,8); this should be an error; baz(9,10); """ from pyparsing import * COMMENT=Suppress('#' + restOfLine()) SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()') ident = Word(alphas, alphanums+"_") integer = Word(nums+"+-",nums) statement = ident("fn") + LPAREN + Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI p = statement.ignore(COMMENT) for res, start, end in p.scanString(sample): print "***** (%d,%d)" % (start, end) print res.dump()

Output:

 ***** (0,10) ['f1', ['1', '2', '3']] - arguments: ['1', '2', '3'] - fn: f1 ***** (11,25) ['f2_no_args', []] - arguments: [] - fn: f2_no_args ***** (53,62) ['bar', ['7', '8']] - arguments: ['7', '8'] - fn: bar ***** (88,98) ['baz', ['9', '10']] - arguments: ['9', '10'] - fn: baz

The ranges returned by scanString have gaps due to the undisclosed text between them ((0.10), (11.25), (53.62), (88.98)). Two of these spaces are spaces or comments that should not cause an error, but one of them ( this should be an error; ) contains text that I do not want to read.

Is there a way to use pyparsing to parse a file, while maintaining that all input can be parsed using the specified parser grammar?

+6

python-2.7 parsing pyparsing

Dan lenski Oct 27 '14 at 15:51

source share

1 answer

Dan lenski · Accepted Answer · 2014-10-28T16:38:38+0000

I came up with what seems like a pretty decent solution after a brief discussion on the PyParsing user mailing list .

I changed the ParserElement.parseString method a ParserElement.parseString to come up with a parseConsumeString that does what I want. This version calls ParserElement._parse and then ParserElement.preParse several times.

Here is the code for the ParserElement monkey patch using the parseConsumeString method:

 from pyparsing import ParseBaseException, ParserElement def parseConsumeString(self, instring, parseAll=True, yieldLoc=False): '''Generator version of parseString which does not try to parse the whole string at once. Should be called with a top-level parser that could parse the entire string if called repeatedly on the remaining pieces. Instead of: ZeroOrMore(TopLevel)).parseString(s ...) Use: TopLevel.parseConsumeString(s ...) If yieldLoc==True, it will yield a tuple of (tokens, startloc, endloc). If False, it will yield only tokens (like parseString). If parseAll==True, it will raise an error as soon as a parse error is encountered. If False, it will return as soon as a parse error is encountered (possibly before yielding any tokens).''' if not self.streamlined: self.streamline() #~ self.saveAsList = True for e in self.ignoreExprs: e.streamline() if not self.keepTabs: instring = instring.expandtabs() try: sloc = loc = 0 while loc<len(instring): # keeping the cache (if in use) across loop iterations wastes memory (can't backtrack outside of loop) ParserElement.resetCache() loc, tokens = self._parse(instring, loc) if yieldLoc: yield tokens, sloc, loc else: yield tokens sloc = loc = self.preParse(instring, loc) except ParseBaseException as exc: if not parseAll: return elif ParserElement.verbose_stacktrace: raise else: # catch and re-raise exception from here, clears out pyparsing internal stack trace raise exc def monkey_patch(): ParserElement.parseConsumeString = parseConsumeString

Note that I also moved the call to ParserElement.resetCache in each iteration of the loop. Since it is impossible to step back from each loop, there is no need to save the cache in iterations. This greatly reduces memory consumption when using the PyParsing packet caching feature. In my tests with a 10 input MiB file, peak memory consumption drops from ~ 6G to ~ 100M peak, and it works 15-20% faster.

Incremental but complete parsing with PyParsing?

More articles: