Normalizing / inconsistent text in Java

Question

Normalizing / inconsistent text in Java

How can I normalize / unaccent text in Java? I am currently using java.text.Normalizer:

Normalizer.normalize(str, Normalizer.Form.NFD) .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")

But this is far from ideal. For example, it leaves the Norwegian characters æ and ø intact. Does anyone know of an alternative? I'm looking for something to convert characters in all languages to just the az range. I understand that there are different ways to do this (for example, should I encode "a", "e" or even "ae"?), And I'm open to any solution. I prefer not to write something myself, because I think it is unlikely that I can do it well for all languages. Performance is NOT critical.

Use case: I want to convert the user-entered name to the simple name az. The converted name will be displayed to the user, so I want him to be as close as possible to what the user wrote in his language.

EDIT:

Well, thanks for denying the post and not addressing my question, yay! :) Perhaps I should have removed the precedent. But please let me clarify. I need to convert the name to keep it inside. I can not control the selected letters. The name will be visible to the user, for example, in the URL. Just like your username on this forum is normalized and displayed to you in the URL if you click on your name. This forum turns the name "Băşan" into "baan" and the name "Øyvind" into "yvind". I believe that this can be done better. I am looking for ideas and, preferably, a library function to do this for me. I know that I can’t understand, I know that “o” and “ø” are different, etc., but if my name is “Øyvind” and I register on the online forum, I would prefer my username was oyvind, not yvind. Hope this makes sense! Thanks!

(And no, we don’t allow the user to choose their own username. I'm really looking for an alternative to java.text.Normalizer. Thanks!)

+4

java text character normalize

John Nov 07 '11 at 23:02

source share

1 answer

Kane · Answer 1 · 2011-11-08T02:46:00+0000

Assuming that you are considering ALL the consequences of what you are doing, ALL how this might go wrong, what will you do when you get Chinese pictograms and other things that have no equivalent in the Latin alphabet ...

There is no library that I know about that does what you want. If you have a list of equivalents (as you say, “æ” - “ae” or something else), you can save them in a file (or, if you do this a lot, in a sorted array in memory, for performance reasons) , and then search and replace the character. If you have a place to store (# Unicode characters) as a char array, the ability to run the unicode values of each character and make direct search the most efficient.

ie, / u1234 => lookupArray [1234] => 'q'

or something else.

so you will have a loop that looks like this:

 StringBuffer buf = new StringBuffer(); for (int i = 0; i < string.length(); i++) { buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]); }

I wrote this from scratch, so there are probably some bad method calls or something like that.

You will need to do something to handle the expanded characters, possibly with a lookahead buffer.

Good luck - I'm sure this is fraught with pitfalls.

Normalizing / inconsistent text in Java

More articles: