Extraction of certain tags from arbitrary plain text

I want to parse text comments and look for specific tags inside them. The types of tags I'm looking for look like this:

<name#1234>

Where "name" is the string [az] (from a fixed list), and "1234" is the number [0-9] +. These tags can occur in a string zero or more times and be surrounded by arbitrary other text. For example, all lines are valid:

"Hello <foo#56> world!"
"<bar#1>!"
"1 &lt; 2"
"+<baz#99>+<squid#0> and also<baz#99>.\n\nBy the way, maybe <foo#9876>"

The following lines are not valid:

"1 < 2"
"<foo>"
"<bar#>"
"Hello <notinfixedlist#1234>"

The latter is invalid because "notinfixedlist" is not a supported named identifier.

I can easily parse this with a simple regular expression, for example (I'm just omitting the named groups):

<[a-z]+#\d+>

or directly specifying a fixed list:

<(foo|bar|baz|squid)#\d+>

but I would like to use antlr for several reasons:

  • , , , , , "<" " > ", , . "& lt;" "& gt;" , .
  • (: "{foo + 666}" "[[@1234]]" . , , .
  • , antlr4 , , , .

​​ antlr4? , , , , , .

, :

grammar Tags;

parse 
    : ( tag | text )*
    ;

tag 
    : '<' fixedlist '#' ID '>'
    ;

fixedlist 
    : 'foo' 
    | 'bar' 
    | 'baz' 
    | 'squid';

text 
    : ~('<' | '>')+
    ;

ID
    : [0-9]+
    ;

?

+4
1

- , .

ANTLR 4 mode s. . split lexer/parser.

parser grammar TagsParser ;

options {
    tokenVocab = TagsLexer ;
}

parse   : ( tag | text )* EOF ;
tag     : LANGLE fixedlist GRIDLET ID RANGLE ;
text    : . ;
fixedlist
    : FOO
    | BAR
    | BAZ
    | SQUID
    ;

lexer grammar TagsLexer ;

LANGLE  : '<' -> pushMode(tag) ;
TEXT    : . ;

mode tag ;
    RANGLE  : '>' -> popMode ;

    FOO     : 'foo' ;
    BAR     : 'bar' ;
    BAZ     : 'baz' ;
    SQUID   : 'squid' ;
    GRIDLET : '#' ;
    ID      : [0-9]+ ;

    NONTAG  : . -> popMode ;

text , . text, , , .

+2

Source: https://habr.com/ru/post/1653066/


All Articles