Parsing tokens into an abstract syntax tree
Ok take the grammar
Dig ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Int ::= Dig | Dig Int Var ::= a | b | ... z | A | B | C | ... | Z Expr ::= Int | - Expr | + Expr Expr | * Expr Expr | Var | let Var = Expr in Expr
This is a nice easy grammar, because from the first sign you can say what kind of suppression it will be. (If there was something more complex, for example + , going between numbers, or - used to subtract as and also negation, you will need a trick with a list of successes explained in Functional Parsers .)
Suppose you have an example input:
rawinput = "- 6 + 45 let x = - 5 in * xx"
What I understand from the grammar is "(- 6 (+ 45 (let x=-5 in (* xx))))" , and Iโll assume that you have designated it as
tokenised_input' = ["-","6","+","4","5","let","x","=","-","5","in","*","x","x"]
which matches the grammar, but you may have received
tokenised_input = ["-","6","+","45","let","x","=","-","5","in","*","x","x"]
which best suits your AST sample. I find it good practice to name your AST after a bit of your grammar, so I'm going to go and replace
data AST = Leaf Int | Sum AST AST | Min AST | ...
with
data Expr = E_Int Int | E_Neg Expr | E_Sum Expr Expr | E_Prod Expr Expr | E_Var Char | E_Let {letvar::Char,letequal:: Expr,letin::Expr} deriving Show
I called the E_Let bits to make it clearer what they represent.
Writing a parsing function
You can use isDigit by adding import Data.Char (isDigit) to help:
expr :: [String] -> (Expr,[String]) expr [] = error "unexpected end of input" expr (s:ss) | all isDigit s = (E_Int (read s),ss) | s == "-" = let (e,ss') = expr ss in (E_Neg e,ss') | s == "+" = (E_Sum e e',ss'') where (e,ss') = expr ss (e',ss'') = expr ss' -- more cases
Hop! There are too many obstacles hiding the meaning, and we will write the same code for E_Prod and much worse for E_Let . Let it figure it out!
The standard way to deal with this is to write some combinators; instead of the tedious thread of input [String] through our definition, define ways to mess with the output of the parsers (map) and combine several parsers into one (elevator).
Clear code 1: map
First we need to define pmap , our own equivalent of the map function, so that we can make pmap E_Neg (expr1 ss) instead of let (e,ss') = expr1 ss in (E_Neg e,ss')
pmap :: (a -> b) -> ([String] -> (a,[String])) -> ([String] -> (b,[String]))
nonono, I canโt even read it! We need a synonym like:
type Parser a = [String] -> (a,[String]) pmap :: (a -> b) -> Parser a -> Parser b pmap fp = \ss -> let (a,ss') = p ss in (fa,ss')
But actually it would be better if I did
data Parser a = Par [String] -> (a,[String])
so i can do
instance Functor Parser where fmap f (Par p) = Par (pmap fp)
I will leave it for you to understand if you want.
Clear code 2: combining two parsers
We also need to deal with a situation where we have two parsers to run, and we want to combine their results using a function. This is called function cancellation for parsers.
liftP2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c liftP2 f p1 p2 = \ss0 -> let (a,ss1) = p1 ss0 (b,ss2) = p2 ss1 in (fab,ss2)
or maybe even three parsers:
liftP3 :: (a -> b -> c -> d) -> Parser a -> Parser b -> Parser c -> Parser d
I will let you think about how to do this. In the let statement, you'll need liftP5 to parse sections of the let statement, raising a function that ignores "=" and "in" . You can do
equals_ :: Parser () equals_ [] = error "equals_: expected = but got end of input" equals_ ("=":ss) = ((),ss) equals_ (s:ss) = error $ "equals_: expected = but got "++s
and a couple more to help with this.
Actually, pmap can also be called liftP1 , but the map is the traditional name for this kind of thing.
Rewritten with good combinators
Now we are ready to clear expr :
expr :: [String] -> (Expr,[String]) expr [] = error "unexpected end of input" expr (s:ss) | all isDigit s = (E_Int (read s),ss) | s == "-" = pmap E_Neg expr ss | s == "+" = liftP2 E_Sum expr expr ss -- more cases
Everything will be fine. Indeed, everything is in order. But liftP5 will be a little longer and messy.
Taking cleaning further is an ultra-pleasant applicative way
Applicative functors are the way to go. Remember that I suggested refactoring like
data Parser a = Par [String] -> (a,[String])
so you can make it an instance of Functor ? Maybe you donโt want to, because all you got is the new fmap name for the perfectly working pmap and you have to deal with all of these Par constructors cluttering your code. Perhaps this will make you rethink, though; we can import Control.Applicative , then using the data declaration we can define <*> , which is then and uses <$> instead of pmap , with *> value <*>-but-forget-the-result-of-the-left-hand-side so you write
expr (s:ss) | s == "let" = E_Let <$> var *> equals_ <*> expr <*> in_ *> expr
Which is very similar to a grammar definition, so it's easy to write code that works for the first time. This is how I like to write parsers. Actually, this is how I like to write terrible things. You would only need to define fmap , <*> and pure , all simple and long repeating liftP3 , liftP4 , etc.
Read about applicative functors. They are wonderful.
Tips for creating an applicator Parser: pure does not change the list. <*> is similar to liftP2 , but the function does not come from outside, it is derived from p1 .