Parsec - the "combinator" many "error is applied to a parser that accepts an empty string"

Question

Parsec - the "combinator" many "error is applied to a parser that accepts an empty string"

I am trying to write a parser using Parsec, which will parse lite Haskell files, such as:

The classic 'Hello, world' program. \begin{code} main = putStrLn "Hello, world" \end{code} More text.

I wrote the following examples from RWH:

 import Text.ParserCombinators.Parsec main = do contents <- readFile "hello.lhs" let results = parseLiterate contents print results data Element = Text String | Haskell String deriving (Show) parseLiterate :: String -> Either ParseError [Element] parseLiterate input = parse literateFile "(unknown)" input literateFile = many codeOrProse codeOrProse = code <|> prose code = do eol string "\\begin{code}" eol content <- many anyChar eol string "\\end{code}" eol return $ Haskell content prose = do content <- many anyChar return $ Text content eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line"

I hope this leads to something like:

 [Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]

(including spaces, etc.).

This compiles fine, but when I start, I get an error:

 *** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string

Can someone shed some light on this and maybe help with a solution?

+6

haskell parsec

stusmith Oct 13 '11 at 12:16

source share

3 answers

I have not tested it, but:

many anyChar can match an empty string
Therefore, prose may match an empty string
Therefore codeOrProse may match an empty string
Consequently, literateFile can loop forever, matching infinitely many blank lines

Changing prose to many1 characters can fix this problem.

(I am not very familiar with Parsec, but how does prose know how many characters it must match? It can consume the entire input without giving the code parser a second chance to find the beginning of a new code segment. Alternatively, it can only match one character in each call making many / many1 useless.)

+5

sth Oct 13 '11 at 12:29

source share

For reference, here is another version that I came across (slightly expanded to handle other cases):

 import Text.ParserCombinators.Parsec main = do contents <- readFile "test.tex" let results = parseLiterate contents print results data Element = Text String | Haskell String | Section String deriving (Show) parseLiterate :: String -> Either ParseError [Element] parseLiterate input = parse literateFile "(unknown)" input literateFile = do es <- many elements eof return es elements = try section <|> try quotedBackslash <|> try code <|> prose code = do string "\\begin{code}" c <- anyChar `manyTill` try (string "\\end{code}") return $ Haskell c quotedBackslash = do string "\\\\" return $ Text "\\\\" prose = do t <- many1 (noneOf "\\") return $ Text t section = do string "\\section{" content <- many1 (noneOf "}") char '}' return $ Section content

0

stusmith Oct 14 '11 at 20:20

source share

bzn · Accepted Answer · 2011-10-13T12:44:28+0000

As many anyChar pointed out, the problem. But not only in prose , but also in code . The problem with code is that content <- many anyChar will consume everything: newlines and the \end{code} tag.

So, you need to somehow tell prose and code. An easy (but perhaps too naive) way to do this is to look for a backslash:

 literateFile = many codeOrProse <* eof code = do string "\\begin{code}" content <- many $ noneOf "\\" string "\\end{code}" return $ Haskell content prose = do content <- many1 $ noneOf "\\" return $ Text content

Now you are not completely getting the desired result, because part of Haskell will also contain newline characters, but you can easily filter them (taking into account the filterNewlines function, filterNewlines can say `content <- filterNewlines <$> (many $ noneOf "\\") ) .

Edit

Ok, I think I found a solution (requires a new version of Parsec, due to lookAhead ):

 import Text.ParserCombinators.Parsec import Control.Applicative hiding (many, (<|>)) main = do contents <- readFile "hello.lhs" let results = parseLiterate contents print results data Element = Text String | Haskell String deriving (Show) parseLiterate :: String -> Either ParseError [Element] parseLiterate input = parse literateFile "" input literateFile = many codeOrProse codeOrProse = code <|> prose code = do string "\\begin{code}\n" c <- untilP (string "\\end{code}\n") string "\\end{code}\n" return $ Haskell c prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "") return $ Text t untilP p = do s <- many $ noneOf "\n" newline s' <- try (lookAhead p >> return "") <|> untilP p return $ s ++ s'

untilP p parses the string, then checks if the beginning of the next string can be parsed successfully with p . If so, it returns an empty string, otherwise it will continue. lookAhead required, because otherwise the begin \ end tags will be used and code will not be able to recognize them.

I suppose that it could still be made more concise (that is, you don't need to repeat the string "\\end{code}\n" inside the code ).

Parsec - the "combinator" many "error is applied to a parser that accepts an empty string"

More articles: