Compare Chinese Unicode strings when multiple code points are the same characters?

Question

Compare Chinese Unicode strings when multiple code points are the same characters?

I am writing Java code that deals with Chinese characters, and I got some unexpected results - the lines that should be equal were not. Here is one of the offensive characters that means six (pinyin: liù): 六. This character can be represented either with two code points:

F9D1 in block: CJK compatibility ideograms
516D in block: Unified CJK ideograms

Wikipedia has a page about these character ranges, and a short section on compatibility ideographs mentions some duplicates, but this particular character is omitted from the list.

So, I am wondering:

Is there a list of duplicate Unicode characters somewhere so that I can convert Strings before trying to compare them?
Is this normal when working with CJK characters, or did I do something else wrong?

+4

unicode normalization unicode-normalization cjk

Rob n Mar 20 '12 at 21:39

source share

1 answer

tchrist · Accepted Answer · 2012-03-20T22:31:43+0000

Just normalize them. U + F9D1 becomes U + 516D for any of the four normalization schemes:

$ export PERL_UNICODE=S $ perl -le 'print "\x{F9D1}\x{516D}"' | uniquote -v \N{CJK COMPATIBILITY IDEOGRAPH-F9D1}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfd | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfc | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfkd | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfkc | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}

All important Unicode tools, including those available here .

Compare Chinese Unicode strings when multiple code points are the same characters?

More articles: