Compare Chinese Unicode strings when multiple code points are the same characters?

I am writing Java code that deals with Chinese characters, and I got some unexpected results - the lines that should be equal were not. Here is one of the offensive characters that means six (pinyin: liΓΉ): ε…­. This character can be represented either with two code points:

F9D1 in block: CJK compatibility ideograms
516D in block: Unified CJK ideograms

Wikipedia has a page about these character ranges, and a short section on compatibility ideographs mentions some duplicates, but this particular character is omitted from the list.

So, I am wondering:

  • Is there a list of duplicate Unicode characters somewhere so that I can convert Strings before trying to compare them?
  • Is this normal when working with CJK characters, or did I do something else wrong?
+4
source share
1 answer

Just normalize them. U + F9D1 becomes U + 516D for any of the four normalization schemes:

$ export PERL_UNICODE=S $ perl -le 'print "\x{F9D1}\x{516D}"' | uniquote -v \N{CJK COMPATIBILITY IDEOGRAPH-F9D1}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfd | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfc | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfkd | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} $ perl -le 'print "\x{F9D1}\x{516D}"' | nfkc | uniquote -v \N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D} 

All important Unicode tools, including those available here .

+2
source

Source: https://habr.com/ru/post/1402541/


All Articles