How to find and replace Unicode characters in Haskell?

Question

How to find and replace Unicode characters in Haskell?

I have a unicode file containing a (wiki) WikiWiki article in MediaText markup. I want to clear its markup. In some cases, I want to extract text from markup tags, such as link names from hyperlinks (for example, a simplified wikiextractor ).

My approach is to run a set of regular expressions over a file to remove markup. In the example link, I need to replace [[link]]with link. I can fix this well with regex, as long as the text does not contain Unicode characters such as ö.

An example of what I tried:

ghci> :m +Data.Text
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Stockholm]]" "\\1"
"Se mer om Stockholm"
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Göteborg]]" "\\1"
"Se mer om [[G\246teborg]]"

Why is this not working? How to make the regex mechanism understand that a öreally normal letter (at least in Swedish)?

Edit: The problem does not seem to be in the template, but in the engine. If I allow all characters except qin the link text, I can expect it to öbe allowed. But not like that ...

ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Goteborg]]" "\\1"
"Goteborg"
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Göteborg]]" "\\1"
"[[G\246teborg]]"
ghci> subRegex (mkRegex "ö") "ö" "q"
"q"
ghci> subRegex (mkRegex "[ö]") "ö" "q"
"\246"

The problem seems to arise especially when using character classes. It exactly matches ö.

+4

regex unicode haskell

Ludvigh Jul 12 '17 at 21:13

source share

1 answer

Ludvigh · Answer 1 · 2017-07-13T10:04:52+0000

Now I decided to go with Text.Regex.PCRE.Heavy, as suggested in this SO answer written by the author. This solves my problem.

So the solution becomes

GHCi, version 7.10.3: http://www.haskell.org/ghc/  :? for help
Prelude> :m Text.Regex.PCRE.Heavy
Prelude Text.Regex.PCRE.Heavy> :set -XFlexibleContexts
Prelude Text.Regex.PCRE.Heavy> :set -XQuasiQuotes
Prelude Text.Regex.PCRE.Heavy> gsub [re|\[\[([^\]]*)\]\]|] (\(firstMatch:_) -> firstMatch :: String) "[[Göteborg]]" :: String
"G\246teborg"

Unfortunately, I still don't know why the POSIX backend cannot handle this, but the PCRE backend can.

How to find and replace Unicode characters in Haskell?

More articles: