Retrieving Information Using BNF Grammars

I would like to extract information from the body of the text and be able to request it.

The structure of this body of text will be determined by the grammar (or variant) of the BNF, and the information to be extracted will be indicated at runtime (the query syntax does not matter at the moment).

So the requirements are simple, in fact:

  • Get some structured text text
  • Download it in form vulnerabilities using grammar to parse it
  • Run the query to select some parts of it.

To illustrate the example, suppose we have such a grammar (in a custom BNF format):

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <id> ::= 15 * digit <hex> ::= 10 * (<digit> | a | b | c | d | e | f) <anything> ::= <digit> | .... (all characters) <match> ::= <id> (" " <hex>)* <nomatch> ::= "." <anything>* <line> ::= (<match> | <nomatch> | "") [<CR>] <LF> <text> ::= <line>+ 

Why such a text will correspond to:

 012345678901234 012345678901234 abcdef0123 Nor the previous line nor this one would match 

And then I would like to list all the tags that appear in the rule, for example, using XPath syntax:

 match//id 

which will return the list.


This sounds relatively easy, except that I have two big limitations:

  • BNF grammar should be read at runtime (from a string / vector, like a structure)
  • requests will be read at runtime.

Some notes:

  • the grammar should not change often, so the step of "compiling" to create a structure in memory is acceptable (and perhaps necessary to achieve good speed).
  • entity speed, bonus points for collecting on the fly from the wanted parts
  • bonus points for the ability to have callbacks to eliminate ambiguity (sometimes access to the database may be necessary for the necessary information about the need, for example)
  • bonus points for multi-part grammars (in favor of modularity and reuse of grammar elements)

I know, for example, lex / yacc and flex / bison, however they seem to only generate C / C ++ code, which is not what I care for.

Do you know about a reliable library (preferably free and open) that can convert BNF grammar to the parser on the fly and create structured output in memory from text using this parser

EDITOR: I am open to alternatives. The idea at this point was that perhaps regular expressions could allow this prey, however, given the complexity of the grammars involved, this could become ugly quickly, and thus maintaining regular expressions would be a pretty terrifying task. Also, by separating grammars and extraction, I hope I can reuse the same grammar for different extraction needs, rather than with slightly different regular expressions each time.

+6
source share
2 answers

I have my own solution, which can transform the source of the grammar into a representation in memory. The result is a clean data structure. Any code can use it. I also have a C ++ class that actually implements a parser. Rule handlers are implemented as virtual methods.

The main difference between our solution and YACC / Bison is that C / C ++ code is not generated. This means that the grammar can be reloaded without recompiling the application. Grammar can be annotated using application identifiers, which are used in the code of the rule handlers.

+2
source

The GOLD analysis system creates a LALR analysis table, which is automatically loaded by AFAIK at run time. I believe that it has a parsing mechanism in C ++, so it is easy to integrate it.

You will read your grammar, expand the subprocess to get the GOLD parser generator to create the table, and then call your wired GOLD parser to load and parse.

I do not know how you attach actions to abbreviations that you probably would like to do. I do not have much experience with GOLD. Golden luck to you.

+1
source

Source: https://habr.com/ru/post/917894/


All Articles