Using Haskell Parsec to parse regex in a single pass

Original Description

I'm trying to do some testing with a custom regular expression engine, but I'm tired of writing NFA manually, so I tried to make the parser with little success. Usually, when people analyze regular expressions, they create several intermediate structures that ultimately transform into the final machine. For my simple definition of NFA, I believe that parsing can actually be done in one pass, although I have not yet determined (a) why it really cannot or (b) how to do it, although my parser can analyze very simple statements.

State objects (simplified) are defined as follows [1]:

type Tag = Int data State a = Literal Tag a (State a) | Split (State a) (State a) | OpenGroup Tag (State a) | CloseGroup Tag (State a) | Accept -- end of expression like "abc" | Final -- end of expression like "abc$" 

Tags must contain an instance of Show and Eq, although the last NFA may contain loops. For example, to model an expression

 -- "^a+(b*)c$" 

I can use [2]

 c = Literal 3 'c' $ Final 1 b = OpenGroup 1 $ Literal 2 'b' bs bs = Split b $ CloseGroup 1 c expr = Literal 1 'a' $ Split expr bs 

I created a stack-based functional analyzer for this grammar (without group tags) by porting the C implementation of the Thompson NFA implementation in Haskell, but this requires two passes [3] to build and will require a third to the end in the structure described above.

So, to build this structure through Parsec, I read a post on this site about the recursive construction of a List-like structure and came up with the following:

 import Control.Applicative import Text.Parsec hiding (many, optional, (<|>)) import Text.ExpressionEngine.Types import qualified Text.ExpressionEngine.Types as T type ParserState = Int type ExpParser = Parsec String ParserState type ExpParserS a = ExpParser (T.State a) parseExpression :: String -> T.State Char parseExpression e = case runParser p 1 ee of Left err -> error $ show err Right r -> r where p = p_rec_many p_char $ p_end 1 p_rec_many :: ExpParser (T.State a -> T.State a) -> ExpParserS a -> ExpParserS a p_rec_many pe = many' where many' = p_some <|> e p_some = p <*> many' p_end :: Int -> ExpParserS a p_end n = (Final n <$ char '$') <|> (Accept n <$ eof) step_index :: ExpParser Int step_index = do index <- getState updateState succ return index p_char = do c <- noneOf "^.[$()|*+?{\\" i <- step_index return $ Literal ic 

And this is enough for parsing strings such as "ab" and "abc $" [4].

Problem

The problem arises when I go to the next step: parsing '|' for or applications. How this should work is a line, for example:

 -- "(a|b|c)$" 

should create the following structure:

 final = Final 1 c = Literal 3 'c' final b = Split (Literal 2 'b' final) c a = Split (Literal 1 'a' final) b 

So, this means that the parser that will be building, or the operators must accept the alternative expression that comes after it and pass it to all branches (I do not believe that changing Split to take a list changes anything instead, because each record should still receive the following expression). My attempt:

 p_regex :: T.State Char -> ExpParser (T.State Char) p_regex e = do first <- p_rec_many p_char $ pure e (Split first <$> (char '|' *> p_regex e)) <|> return first 

And the main parser will change to:

 parseExpression :: String -> T.State Char parseExpression e = case runParser p 1 ee of Left err -> error $ show err Right r -> r where p = p_regex <*> p_end 1 

But this does not allow us to introduce verification [5]. I would expect this to be correct, because p_regex must have an inline object (State a) that was added to it, and creating literal chains with p_rec_many also seems to work this way.

Maybe I should use buildExpressionTable? This may help in this particular problem, because I could make ('$' <|> eof) the highest priority. I started trying, but I canโ€™t imagine how I will handle things like star plus and question mark operators, since they all must be referenced by themselves.

(EDIT: I tried again using buildExpressionTable, and it seems to me that it's too simplistic for what I want to do. It cannot handle ordered postfix statements (like "?"), And my plan for creating it is "'$' < |> eof "the highest priority will not work either because it will be attached only to the last parsed term, and not to the entire line. Even if I could do this, the statement will be applied back: it should be the last parsed term and passed to the previous one member, the more I work with it, the more Iโ€™m interested, not a trace is to simply turn the expression string before parsing.)

Question

So what am I doing wrong? I am sure there is a way to do what I am trying to do, I just could not figure it out so far. Thank you for your time.

Footnote

[1] If you want to know exactly what I actually use, it can be found here .

[2] The idea behind Open / CloseGroup tags is to track group matches when NFA starts. The placement in the specified expression may not be entirely correct, but this method will work correctly if the detected CloseGroup tags create a capture group only if the corresponding OpenGroup is found (i.e., in the above example, we will only create a capture if at least one 'b').

All other tag construction is correct, and I tested that this NFA matches strings as expected.

[3] The implementation of Thompson is described here , and my port can be seen here . This is ideal for a subset of the NFA, but in the resulting structure, each subsequent state will be wrapped in Just. This is because I use Nothing to indicate dangling pointers, and a later step will be fixed in the next next step. I could convert this structure to the above by translating all records (states) into records (states), but this will be the third pass. This implementation already requires a first pass to convert the regular expression to postfix notation.

[4] Result

 Literal 1 'a' (Literal 2 'b' (Accept 1)) 

and

 Literal 1 'a' (Literal 2 'b' (Literal 3 'c' (Final 1))) 

respectively.

[5]

 Couldn't match expected type `a0 -> b0' with actual type `ExpParser (T.State Char)' Expected type: T.State Char -> a0 -> b0 Actual type: T.State Char -> ExpParser (T.State Char) In the first argument of `(<*>)', namely `p_regex' In the expression: p_regex <*> p_end 1 
+6
source share
2 answers

So, the question was mainly: given the recursive data structure (defined in the question), how can I create a parser that will build my expression in one pass. My initial attempt was kind of โ€œappliedโ€ in nature. I was able to create a recursive structure if there was no conditional branching. But branching is required to parse the regex, so my approach will not work for or operators.

To solve this, I needed to have some kind of state. A good way to transfer state to a functional language is to use a partially applied function. I already had a base for this in that the signature for p_char higher:

 p_char :: ExpParser (T.State Char -> T.State Char) 

So I need to combine them, these are combinators that combine several functions (T.State Char -> T.State Char) . Thus, with this understanding, the sequence becomes:

 p_many1 :: ExpParser (T.State Char -> T.State Char) -> ExpParser (T.State Char -> T.State Char) p_many1 p = do f <- p (p_many1 p >>= return . (f .)) <|> return f 

Now for the or operator we need something that takes an expression like "a | b | c" and creates a function like:

 \e -> Split (Literal 1 'a' e) (Split (Literal 2 'b' e) (Literal 3 'c' e)) 

To do this, we can use this:

 p_splitBy1 :: ExpParser (T.State Char -> T.State Char) -> Parsec String ParserState Char -> ExpParser (T.State Char -> T.State Char) p_splitBy1 p sep = do f <- p (sep >> p_splitBy1 p sep >>= return . (\f' e -> Split (fe) (f' e))) <|> return f 

And it really creates the structure that I need. Therefore, if someone else encounters a similar problem in the future, perhaps this question / answer may be helpful.

+1
source

You probably are not getting a lot of answers, because this is a huge question that requires reading a huge paper before anyone can think about writing an answer.

Having said that, coincidentally with ugliness, I'm just trying to build the NFA from regexes this week myself .; -)


OK, so the immediate problem is

 Couldn't match expected type `x -> y` with actual type `Parser`. 

In English, this means that somewhere you have a function instead of a parser. A quick glance at your code shows what you wrote

 where p = p_regex <*> p_end 1 

But p_regex takes 1 argument and you did not send it. This is why your code does not check type.


OK, so step back, what is your actual problem? Do you want to parse the regular expression in the NFA, but the document wants you to convert the regular expression to postfix notation, then parse it, then build the NFA?

It seems like it should be possible. When I implemented this, I had parsing and generating NFA in separate steps, but I could check if the parser was working and that the NFA was working separately. But it looks like it should be possible. Parsec allows you to have custom state, so you can use it as a stack to store NFA fragments. (Or explicitly pass it if you want).

If you need a more accurate answer, you may have to trim it to a smaller, more focused question.

+5
source

Source: https://habr.com/ru/post/944758/


All Articles