Original Description
I'm trying to do some testing with a custom regular expression engine, but I'm tired of writing NFA manually, so I tried to make the parser with little success. Usually, when people analyze regular expressions, they create several intermediate structures that ultimately transform into the final machine. For my simple definition of NFA, I believe that parsing can actually be done in one pass, although I have not yet determined (a) why it really cannot or (b) how to do it, although my parser can analyze very simple statements.
State objects (simplified) are defined as follows [1]:
type Tag = Int data State a = Literal Tag a (State a) | Split (State a) (State a) | OpenGroup Tag (State a) | CloseGroup Tag (State a) | Accept -- end of expression like "abc" | Final -- end of expression like "abc$"
Tags must contain an instance of Show and Eq, although the last NFA may contain loops. For example, to model an expression
I can use [2]
c = Literal 3 'c' $ Final 1 b = OpenGroup 1 $ Literal 2 'b' bs bs = Split b $ CloseGroup 1 c expr = Literal 1 'a' $ Split expr bs
I created a stack-based functional analyzer for this grammar (without group tags) by porting the C implementation of the Thompson NFA implementation in Haskell, but this requires two passes [3] to build and will require a third to the end in the structure described above.
So, to build this structure through Parsec, I read a post on this site about the recursive construction of a List-like structure and came up with the following:
import Control.Applicative import Text.Parsec hiding (many, optional, (<|>)) import Text.ExpressionEngine.Types import qualified Text.ExpressionEngine.Types as T type ParserState = Int type ExpParser = Parsec String ParserState type ExpParserS a = ExpParser (T.State a) parseExpression :: String -> T.State Char parseExpression e = case runParser p 1 ee of Left err -> error $ show err Right r -> r where p = p_rec_many p_char $ p_end 1 p_rec_many :: ExpParser (T.State a -> T.State a) -> ExpParserS a -> ExpParserS a p_rec_many pe = many' where many' = p_some <|> e p_some = p <*> many' p_end :: Int -> ExpParserS a p_end n = (Final n <$ char '$') <|> (Accept n <$ eof) step_index :: ExpParser Int step_index = do index <- getState updateState succ return index p_char = do c <- noneOf "^.[$()|*+?{\\" i <- step_index return $ Literal ic
And this is enough for parsing strings such as "ab" and "abc $" [4].
Problem
The problem arises when I go to the next step: parsing '|' for or applications. How this should work is a line, for example:
should create the following structure:
final = Final 1 c = Literal 3 'c' final b = Split (Literal 2 'b' final) c a = Split (Literal 1 'a' final) b
So, this means that the parser that will be building, or the operators must accept the alternative expression that comes after it and pass it to all branches (I do not believe that changing Split to take a list changes anything instead, because each record should still receive the following expression). My attempt:
p_regex :: T.State Char -> ExpParser (T.State Char) p_regex e = do first <- p_rec_many p_char $ pure e (Split first <$> (char '|' *> p_regex e)) <|> return first
And the main parser will change to:
parseExpression :: String -> T.State Char parseExpression e = case runParser p 1 ee of Left err -> error $ show err Right r -> r where p = p_regex <*> p_end 1
But this does not allow us to introduce verification [5]. I would expect this to be correct, because p_regex must have an inline object (State a) that was added to it, and creating literal chains with p_rec_many also seems to work this way.
Maybe I should use buildExpressionTable? This may help in this particular problem, because I could make ('$' <|> eof) the highest priority. I started trying, but I canโt imagine how I will handle things like star plus and question mark operators, since they all must be referenced by themselves.
(EDIT: I tried again using buildExpressionTable, and it seems to me that it's too simplistic for what I want to do. It cannot handle ordered postfix statements (like "?"), And my plan for creating it is "'$' < |> eof "the highest priority will not work either because it will be attached only to the last parsed term, and not to the entire line. Even if I could do this, the statement will be applied back: it should be the last parsed term and passed to the previous one member, the more I work with it, the more Iโm interested, not a trace is to simply turn the expression string before parsing.)
Question
So what am I doing wrong? I am sure there is a way to do what I am trying to do, I just could not figure it out so far. Thank you for your time.
Footnote
[1] If you want to know exactly what I actually use, it can be found here .
[2] The idea behind Open / CloseGroup tags is to track group matches when NFA starts. The placement in the specified expression may not be entirely correct, but this method will work correctly if the detected CloseGroup tags create a capture group only if the corresponding OpenGroup is found (i.e., in the above example, we will only create a capture if at least one 'b').
All other tag construction is correct, and I tested that this NFA matches strings as expected.
[3] The implementation of Thompson is described here , and my port can be seen here . This is ideal for a subset of the NFA, but in the resulting structure, each subsequent state will be wrapped in Just. This is because I use Nothing to indicate dangling pointers, and a later step will be fixed in the next next step. I could convert this structure to the above by translating all records (states) into records (states), but this will be the third pass. This implementation already requires a first pass to convert the regular expression to postfix notation.
[4] Result
Literal 1 'a' (Literal 2 'b' (Accept 1))
and
Literal 1 'a' (Literal 2 'b' (Literal 3 'c' (Final 1)))
respectively.
[5]
Couldn't match expected type `a0 -> b0' with actual type `ExpParser (T.State Char)' Expected type: T.State Char -> a0 -> b0 Actual type: T.State Char -> ExpParser (T.State Char) In the first argument of `(<*>)', namely `p_regex' In the expression: p_regex <*> p_end 1