How can I normalize / unaccent text in Java? I am currently using java.text.Normalizer:
Normalizer.normalize(str, Normalizer.Form.NFD) .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
But this is far from ideal. For example, it leaves the Norwegian characters รฆ and รธ intact. Does anyone know of an alternative? I'm looking for something to convert characters in all languages โโto just the az range. I understand that there are different ways to do this (for example, should I encode "a", "e" or even "ae"?), And I'm open to any solution. I prefer not to write something myself, because I think it is unlikely that I can do it well for all languages. Performance is NOT critical.
Use case: I want to convert the user-entered name to the simple name az. The converted name will be displayed to the user, so I want him to be as close as possible to what the user wrote in his language.
EDIT:
Well, thanks for denying the post and not addressing my question, yay! :) Perhaps I should have removed the precedent. But please let me clarify. I need to convert the name to keep it inside. I can not control the selected letters. The name will be visible to the user, for example, in the URL. Just like your username on this forum is normalized and displayed to you in the URL if you click on your name. This forum turns the name "Bฤลan" into "baan" and the name "รyvind" into "yvind". I believe that this can be done better. I am looking for ideas and, preferably, a library function to do this for me. I know that I canโt understand, I know that โoโ and โรธโ are different, etc., but if my name is โรyvindโ and I register on the online forum, I would prefer my username was oyvind, not yvind. Hope this makes sense! Thanks!
(And no, we donโt allow the user to choose their own username. I'm really looking for an alternative to java.text.Normalizer. Thanks!)
source share