Why does the Parsec Choice operator seem to depend on the order of the parsers?

I am trying to parse a very simple language consisting of only decimal or binary numbers. For example, here are a few valid inputs:

#b1 #d1 #b0101 #d1234 

I have a problem using the Parsec select statement: <|> . According to the textbook: Write yourself a chart in 48 hours :

The [select statement] attempts to use the first parser, and then, if it does not work, tries to execute the second. If either succeeds, it also returns the value returned by this parser.

But, in my experience, I see that the order of the parsers has raised questions. Here is my program:

 import System.Environment import Text.ParserCombinators.Parsec main :: IO () main = do (x:_) <- getArgs putStrLn ( "Hello, " ++ readExp x) bin :: Parser String bin = do string "#b" x <- many( oneOf "01") return x dec :: Parser String dec = do string "#d" x <- many( oneOf "0123456789") return x -- Why does order matter here? parseExp = (bin <|> dec) readExp :: String -> String readExp input = case parse parseExp "test" input of Left error -> "Error: " ++ show error Right val -> "Found val" ++ show val 

This is how I run the program:

Dependency Installation

 $ cabal sandbox init $ cabal install parsec 

Compilation

 $ cabal exec ghc Main 

Run

 $ ./Main "#d1" Hello, Error: "test" (line 1, column 1): unexpected "d" expecting "#b" $ ./Main "#b1" Hello, Found val"1" 

If I reordered the parsers as follows:

 parseExp = (dec <|> bin) 

then only binary numbers are detected, and the program cannot identify decimal numbers.

When performing the tests that I performed, I see that this problem only occurs when one of the parsers started parsing the input, for example. if a hash symbol # , the bean parser is activated, which ends when there is no next expected character b , not d . There seems to be some kind of retreat that should happen, which I don't know about.

Appreciate the help!

+5
source share
2 answers

Parsec has two kinds of "crashes": there are crashes that consume input, and crashes that don't. To avoid backtracking (and therefore, keeping the input data longer than necessary / usually unfriendly to the garbage collector), (<|>) captures the first parser as soon as it consumes the input; so if his first argument consumes input and fails, his second parser will never get a chance of success. You can explicitly request reverse tracking behavior with try , this way:

 Text.Parsec> parse (string "ab" <|> string "ac") "" "ac" Left (line 1, column 1): unexpected "c" expecting "ab" Text.Parsec> parse (try (string "ab") <|> string "ac") "" "ac" Right "ac" 

Unfortunately, try has some pretty nasty performance penalties, which means that if you want a parser, you have to reorganize your grammar a bit. I would do this using the above parsing like this:

 Text.Parsec> parse (char 'a' >> (("ab" <$ char 'b') <|> ("ac" <$ char 'c'))) "" "ac" Right "ac" 

In your case, you will need to appropriately separate the "#" sign.

+6
source

Read carefully:

[Selection statement] tries to execute the first parser, and then, if it does not work, tries the second. If either succeeds, the value returned by this parser is also returned.

This means that the first parser is executed first, and if it succeeds, the second parser will NOT be checked at all! This means that the first parser has a higher priority, so <|> is generally not commutative.

A simple counterexample can be created using some sequentially executed parsers, for example

 dummy :: Parser Bool dummy = return True <|> return False 

The above is equivalent to return True : since the first is successful, the second is immaterial.


In addition, parsec is designed to be committed at an early stage in the first branch when it successfully used some input. This design sacrifices ideal non-determinism for efficiency. Otherwise, parsec often needs to store in memory the entire file that is being processed, because if a parsing error occurs, you must try a new alternative.

For instance,

 p = (char 'a' >> char 'b' >> ...) <|> (char 'a' >> char 'c' >> ...) 

won't work as you would expect. As soon as 'a' successfully used by the first branch, parsec fixes it. This means that if 'c' follows, then the first branch will fail, but it's too late for the second to be judged. The entire parser just fails.

To solve this problem, you can drop the common prefix, for example

 p = char 'a' >> ( (char 'b' >> ...) <|> (char 'c' >> ...) ) 

or resort to try ,

 p = (try (char 'a' >> char 'b') >> ...) <|> (char 'a' >> char 'c' >> ...) 

try basically tells parsec not to commit the branch until the entire parser in try succeeds. If this is an abuse, it may cause parsec to store most of the file in memory, but the user at least has some control over it. Theoretically, ideal non-determinism can be restored by always adding try to the entire left branch <|> . This, however, is not recommended because it forces the programmer to write an inefficient parser, both because of a memory problem and because it is easy to write a grammar that requires an exponential number of return paths for successful analysis.

+3
source

Source: https://habr.com/ru/post/1233431/


All Articles