Comparing Unicode in Perl and Java

Question

Comparing Unicode in Perl and Java

What is the best way to maintain a function that provides identical sorting in Perl and Java when comparing strings? Here's an example function in Perl:

sub compare_strs { my ( $str1, $str2 ) = @_; # Treat vars as strings by quoting. # Possibly incorrect/irrelevant approach. return ("$str1" cmp "$str2"); }

The concern here is:

The string may contain Chinese / Japanese characters. The Perl code above cannot be dependent to give the expected result. How to ensure that both Perl and Java implementations can perform string mappings in the same way?

+4

java perl unicode

syker Jul 26 '13 at 20:05

source share

1 answer

Ted hopp · Accepted Answer · 2013-07-26T20:42:05+0000

For Perl, do not use the cmp operator. Instead, you should use the Unicode::Collate :

 use Unicode::Collate; sub compare_strs { my ( $str1, $str2 ) = @_; # Treat vars as strings by quoting. # Possibly incorrect/irrelevant approach. return $Collator->cmp("$str1", "$str2"); }

If you are concerned about normalization (for example, the order in which the labels are combined), you can also use the Unicode::Normalize module.

In Java, use the Collator class, as described in the string comparison guide . For normalization, see the text normalization tutorial . The required classes were introduced in Java 1.6; if you need to support earlier versions of Java, you will need to use something like ICU libraries .

Using appropriate tools, as described above, should ensure that both environments behave according to the Unicode sorting algorithm (and therefore are compatible with each other).

Comparing Unicode in Perl and Java

More articles: