I have a unicode file containing a (wiki) WikiWiki article in MediaText markup. I want to clear its markup. In some cases, I want to extract text from markup tags, such as link names from hyperlinks (for example, a simplified wikiextractor ).
My approach is to run a set of regular expressions over a file to remove markup. In the example link, I need to replace [[link]]with link. I can fix this well with regex, as long as the text does not contain Unicode characters such as ö.
An example of what I tried:
ghci> :m +Data.Text
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Stockholm]]" "\\1"
"Se mer om Stockholm"
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Göteborg]]" "\\1"
"Se mer om [[G\246teborg]]"
Why is this not working? How to make the regex mechanism understand that a öreally normal letter (at least in Swedish)?
Edit:
The problem does not seem to be in the template, but in the engine. If I allow all characters except qin the link text, I can expect it to öbe allowed. But not like that ...
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Goteborg]]" "\\1"
"Goteborg"
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Göteborg]]" "\\1"
"[[G\246teborg]]"
ghci> subRegex (mkRegex "ö") "ö" "q"
"q"
ghci> subRegex (mkRegex "[ö]") "ö" "q"
"\246"
The problem seems to arise especially when using character classes. It exactly matches ö.
source
share