How to compare all accented forms of a certain character?

Question

How to compare all accented forms of a certain character?

I would like to write a regular expression that will match all accented forms of a particular character in text encoding using some Unicode encoding, without explicitly listing all such forms in a character class.

So, for example, if I wanted to match any accented version of a , [aàáâãäå] not enough, since it only gets a that live in ISO-8859-1, and it might be nice to be other accents that aren't there. What would be acceptable is something like \p{Base_Character: a} , were there such things defined in Unicode. Something that does this?

Edit: I cannot ASCIIfy a string at first --- the string is in a database to which I do not have direct access. In fact, I do not have access to all levels of code. The only input I can give is a regular expression.

+4

regex unicode pcre

uckelman Jan 23 '12 at 18:33

source share

2 answers

mvrak · Answer 1 · 2012-01-23T18:36:27+0000

No, there are no libraries that do anything other than a list of appropriate codes for accented versions. Even in UTF-8, I do not see distinguishable patterns among codes. Honestly, making this list of other accented versions won't take too long.

leonbloy · Answer 2 · 2012-01-23T19:04:15+0000

I do not think you can do this. A workaround that may help depending on your language / platform and needs is to ascii-fy your line before matching a . For example, in Java:

  String s1 = "Hernán"; String s2 = Normalizer.normalize(s1, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""); // s2: "Hernan" System.out.println(s2); System.out.println(s2.matches(".*a.*"));

How to compare all accented forms of a certain character?

More articles: