How to make Attoparsec parser successful without using (e.g. parsec lookAhead)

Question

How to make Attoparsec parser successful without using (e.g. parsec lookAhead)

I wrote a quick attoparsec parser to go to an aspx file and discard all the style attributes, and it works fine, except for one fragment of it, where I cannot figure out how to succeed when matching > without using it.

Here is what I have:

 anyTill = manyTill anyChar anyBetween start end = start *> anyTill end styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"") styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">") everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar

I understand this in part because I use manyTill in all ButStyles, that I actively drop all styles on the ground, but in styleWithoutQuotes I need it to match ">" as the end, but don’t use it, in parsec I would just did lookAhead ">" , but I cannot do this in attoparsec.

+4

parsing haskell attoparsec

Jimmy hoffa Nov 02 '12 at 20:10

source share

2 answers

 anyBetween start end = start *> anyTill end

Your anyBetween parser eats its last character, because anyTill does - it is intended to be parsed to the final marker, but on the condition that you do not want the closing curly bracket on the input to be parsed again.

Note that your end parsers are all character parsers, so we can change the functionality to use this:

 anyBetween'' start ends = start *> many (satisfy (not.flip elem ends))

but many not as efficient as Attoparsec takeWhile , which you should use as much as possible, so if you did

 import qualified Data.Attoparsec.Text as A

then

 anyBetween' start ends = start *> A.takeWhile (not.flip elem ends)

gotta do the trick and we can rewrite

 styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']

If you want him to eat ' ' , but not '>' , you can explicitly use spaces afterwards:

 styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>'] <* A.takeWhile isSpace

Switch to more `takeWhile`

Perhaps styleWithQuotes could do with rewrite to use takeWhile , so let's make two helpers on anyBetween lines. They take from the initial parser to the final character and include all-encompassing and exclusive versions:

 fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars) fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar

But I think, from what you said, you want styleWithoutQuotes be a hybrid; he eats, but not > :

 fromUptoEat startP endChars eatChars = startP *> takeTill (flip elem endChars) <* satisfy (flip elem eatChars)

(All of them assume a small number of characters in your final character lists, otherwise elem inefficient - there are several options for Set if you are checking a large list, for example, the alphabet.)

Now to rewrite:

 styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\"" styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " "

Generic Parser

everythingButStyles uses <|> in such a way that if it does not find "style" , it will back off and then take everything. This is an example of what can be slow. The problem is that we fail - at the end of the input line, that is a bad time to make a choice about whether we can fail. Release everything and try

Immediate failure if we fail.
Maximize the use of faster parsers from Data.Attoparsec.Text.Internal

Idea: take it until we get s, and then skip the style, if any.

 notStyleNotEvenS = takeTill (flip elem "sS") skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS <|> cons <$> anyChar <*> notStyleNotEvenS

anyChar usually s or s , but there is no point in checking again.

 noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle parseNoStyles = parseOnly noStyles

+5

AndrewC Nov 02 '12 at 23:57

source share

Daniel Fischer · Accepted Answer · 2012-11-02T23:08:42+0000

Meanwhile, lookAhead was added to attoparsec , so now you can use lookAhead (char '>') or lookAhead (string ">") to achieve the goal.

The following is a workaround from time to time.

You can build your obscene parser using peekWord8 , which just looks at the next byte (if any). Since ByteString has a Monoid instance, Parser ByteString is MonadPlus , and you can use

 lookGreater = do mbw <- peekWord8 case mbw of Just 62 -> return ">" _ -> mzero

(62 is the code point '>' ) to either find a '>' without using it or not execute.

How to make Attoparsec parser successful without using (e.g. parsec lookAhead)

Switch to more takeWhile

Generic Parser

More articles:

Switch to more `takeWhile`