How to make Attoparsec parser successful without using (e.g. parsec lookAhead)

I wrote a quick attoparsec parser to go to an aspx file and discard all the style attributes, and it works fine, except for one fragment of it, where I cannot figure out how to succeed when matching > without using it.

Here is what I have:

 anyTill = manyTill anyChar anyBetween start end = start *> anyTill end styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"") styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">") everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar 

I understand this in part because I use manyTill in all ButStyles, that I actively drop all styles on the ground, but in styleWithoutQuotes I need it to match ">" as the end, but don’t use it, in parsec I would just did lookAhead ">" , but I cannot do this in attoparsec.

+4
source share
2 answers

Meanwhile, lookAhead was added to attoparsec , so now you can use lookAhead (char '>') or lookAhead (string ">") to achieve the goal.

The following is a workaround from time to time.


You can build your obscene parser using peekWord8 , which just looks at the next byte (if any). Since ByteString has a Monoid instance, Parser ByteString is MonadPlus , and you can use

 lookGreater = do mbw <- peekWord8 case mbw of Just 62 -> return ">" _ -> mzero 

(62 is the code point '>' ) to either find a '>' without using it or not execute.

+4
source
 anyBetween start end = start *> anyTill end 

Your anyBetween parser eats its last character, because anyTill does - it is intended to be parsed to the final marker, but on the condition that you do not want the closing curly bracket on the input to be parsed again.

Note that your end parsers are all character parsers, so we can change the functionality to use this:

 anyBetween'' start ends = start *> many (satisfy (not.flip elem ends)) 

but many not as efficient as Attoparsec takeWhile , which you should use as much as possible, so if you did

 import qualified Data.Attoparsec.Text as A 

then

 anyBetween' start ends = start *> A.takeWhile (not.flip elem ends) 

gotta do the trick and we can rewrite

 styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>'] 

If you want him to eat ' ' , but not '>' , you can explicitly use spaces afterwards:

 styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>'] <* A.takeWhile isSpace 

Switch to more takeWhile

Perhaps styleWithQuotes could do with rewrite to use takeWhile , so let's make two helpers on anyBetween lines. They take from the initial parser to the final character and include all-encompassing and exclusive versions:

 fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars) fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar 

But I think, from what you said, you want styleWithoutQuotes be a hybrid; he eats, but not > :

 fromUptoEat startP endChars eatChars = startP *> takeTill (flip elem endChars) <* satisfy (flip elem eatChars) 

(All of them assume a small number of characters in your final character lists, otherwise elem inefficient - there are several options for Set if you are checking a large list, for example, the alphabet.)

Now to rewrite:

 styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\"" styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " " 

Generic Parser

everythingButStyles uses <|> in such a way that if it does not find "style" , it will back off and then take everything. This is an example of what can be slow. The problem is that we fail - at the end of the input line, that is a bad time to make a choice about whether we can fail. Release everything and try

Idea: take it until we get s, and then skip the style, if any.

 notStyleNotEvenS = takeTill (flip elem "sS") skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS <|> cons <$> anyChar <*> notStyleNotEvenS 

anyChar usually s or s , but there is no point in checking again.

 noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle parseNoStyles = parseOnly noStyles 
+5
source

Source: https://habr.com/ru/post/1443734/


All Articles