Is it possible to parse a large file using ANTLR?

Can I instruct ANTLR not to load the entire file into memory? Can they apply the rules one by one and generate the list of the highest list in sequence, along with the read file? It is also possible that you can somehow remove the analyzed nodes?

+6
source share
2 answers

Yes you can use:

  • UnbufferedCharStream for your character stream (passed to lexer)
  • UnbufferedTokenStream for token stream (passed to the parser)
    • The implementation of this token does not distinguish token channels, so use ->skip instead of ->channel(HIDDEN) as a command in lexer rules that should not be sent to the parser.
  • Be sure to call setBuildParseTree(false) in your parser, or a giant syntax tree will be created for the entire file.

Edit with an additional comment:

  • I put a lot of effort into making sure that UnbufferedCharStream and UnbufferedTokenStream work the most โ€œreasonablyโ€, especially with regard to the mark , release , seek and getText methods. My goal was to maintain the greatest possible functionality of these methods without compromising the ability of the thread to free up unused memory.
  • ANTLR 4 allows you to use unlimited viewing. If your grammar requires a look at EOF to make a decision, you cannot avoid loading all the input into memory. You will need to be very careful to avoid this situation when writing grammar.
+11
source

The Antlr.org website has a Wiki page that talks about your question; It seems that now can not find.

In fact, the lexer reads data using the standard InputStream interface, in particular ANTLRInputStream.java. A typical implementation is ANTLRFileStream.java , which proactively reads the entire input file into memory. You need to write your own buffered version - "ANTLRBufferedFileStream.java" - which is read from the source file as needed. Or just set the standard BufferedInputStream / FileInputStream as the data source in AntlrInputStream.

One caveat is that Antlr4 may use unlimited representation. Most likely, the problem is for a buffer with a reasonable size in normal operation. Most likely when the parser tries to repair the error. Antlr4 allows you to adapt the error recovery strategy, so the problem is manageable.

Additional Information:

Essentially, Antlr implements pull-parser. When you call the first analyzer rule, the parser requests tokens from the lexer, which requests character data from the input stream. The parser / lexer interface is implemented through a buffered token stream, nominally BufferedTokenStream .

A parse tree is more than a tree structure for these tokens. Well, much more, but not in terms of data size. Each token represents an INT value, usually supported by a fragment of the input data stream that corresponds to the definition of the token. The lexer itself does not require a full copy of the lex'd input character stream for storage in memory. And token fragments can be zero. The critical memory requirement for lexer is to view the input character stream in view mode, taking into account the input stream to the spooled file.

Depending on your needs, the memory parsing tree can be small even if you have a 100 GB + input file.

To help you, you need to explain what exactly you are trying to do in Antlr, and what defines the minimum critical memory requirement. This will help determine what additional strategies can be recommended. For example, if the source data are history, you can use several lexer / parser runs each time you select different parts of the source data for processing in the lexer. Compared to reading files and writing to a database, even with fast disks, Antlr execution is likely to be barely noticeable.

+3
source

Source: https://habr.com/ru/post/944496/


All Articles