I would like to extract information from the body of the text and be able to request it.
The structure of this body of text will be determined by the grammar (or variant) of the BNF, and the information to be extracted will be indicated at runtime (the query syntax does not matter at the moment).
So the requirements are simple, in fact:
- Get some structured text text
- Download it in form vulnerabilities using grammar to parse it
- Run the query to select some parts of it.
To illustrate the example, suppose we have such a grammar (in a custom BNF format):
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <id> ::= 15 * digit <hex> ::= 10 * (<digit> | a | b | c | d | e | f) <anything> ::= <digit> | .... (all characters) <match> ::= <id> (" " <hex>)* <nomatch> ::= "." <anything>* <line> ::= (<match> | <nomatch> | "") [<CR>] <LF> <text> ::= <line>+
Why such a text will correspond to:
012345678901234 012345678901234 abcdef0123 Nor the previous line nor this one would match
And then I would like to list all the tags that appear in the rule, for example, using XPath syntax:
match
which will return the list.
This sounds relatively easy, except that I have two big limitations:
- BNF grammar should be read at runtime (from a string / vector, like a structure)
- requests will be read at runtime.
Some notes:
- the grammar should not change often, so the step of "compiling" to create a structure in memory is acceptable (and perhaps necessary to achieve good speed).
- entity speed, bonus points for collecting on the fly from the wanted parts
- bonus points for the ability to have callbacks to eliminate ambiguity (sometimes access to the database may be necessary for the necessary information about the need, for example)
- bonus points for multi-part grammars (in favor of modularity and reuse of grammar elements)
I know, for example, lex / yacc and flex / bison, however they seem to only generate C / C ++ code, which is not what I care for.
Do you know about a reliable library (preferably free and open) that can convert BNF grammar to the parser on the fly and create structured output in memory from text using this parser
EDITOR: I am open to alternatives. The idea at this point was that perhaps regular expressions could allow this prey, however, given the complexity of the grammars involved, this could become ugly quickly, and thus maintaining regular expressions would be a pretty terrifying task. Also, by separating grammars and extraction, I hope I can reuse the same grammar for different extraction needs, rather than with slightly different regular expressions each time.