How to identify programmatically in Java, supported by the Unicode version?

Question

How to identify programmatically in Java, supported by the Unicode version?

Due to the fact that Java code can be run on any Java virtual machine, I would like to know how to programmatically determine which version of Unicode was supported?

+6

java java-7 unicode jvm

Roman kagan Aug 4 '11 at 12:15

source share

5 answers

Or you can check its general category using regex. Here are a few highlighted code points:

Unicode 6.0.0:

 Ꞡ U+A7A0 GC=Lu SC=Latin LATIN CAPITAL LETTER G WITH OBLIQUE STROKE ₹ U+20B9 GC=Sc SC=Common INDIAN RUPEE SIGN ₜ U+209C GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER T

 Ɒ U+2C70 GC=Lu SC=Latin LATIN CAPITAL LETTER TURNED ALPHA ‭⅐ U+2150 GC=No SC=Common VULGAR FRACTION ONE SEVENTH ⸱ U+2E31 GC=Po SC=Common WORD SEPARATOR MIDDLE DOT

Unicode 5.1:

 ‭ꝺ U+A77A GC=Ll SC=Latin LATIN SMALL LETTER INSULAR D Ᵹ U+A77D GC=Lu SC=Latin LATIN CAPITAL LETTER INSULAR ⚼ U+26BC GC=So SC=Common SESQUIQUADRATE

Unicode 5.0:

 Ⱶ U+2C75 GC=Lu SC=Latin LATIN CAPITAL LETTER HALF H ɂ U+0242 GC=Ll SC=Latin LATIN SMALL LETTER GLOTTAL STOP ⬔ U+2B14 GC=So SC=Common SQUARE WITH UPPER RIGHT DIAGONAL HALF BLACK

I have included the general category and script property, although you can only check the script in JDK7, the first release of Java that supports this.

I found these code points by running commands from the command line:

 % unichars -gs '\p{Age=5.1}' % unichars -gs '\p{Lu}' '\p{Age=5.0}'

Where is the unichars program located . It will only find the properties supported in the Unicode character database, depending on the version of UCD supported by the version of Perl that you are running.

I also like that my result is sorted, so I usually run

  % unichars -gs '\p{Alphabetic}' '\p{Age=6.0}' | ucsort | less -r

where this ucsort is a program that sorts text according to the Unicode sorting algorithm.

However, in Perl, unlike Java, this is easy to learn. For example, if you run this from the command line (yes, theres an API programmer too), you will find:

 $ corelist -a Unicode v5.6.2 3.0.1 v5.8.0 3.2.0 v5.8.1 4.0.0 v5.8.8 4.1.0 v5.10.0 5.0.0 v5.10.1 5.1.0 v5.12.0 5.2.0 v5.14.0 6.0.0

This shows that Perl version 5.14.0 was the first to support Unicode 6.0.0. For Java, I believe that there is no API that gives you this information directly, so you have to hard-code the tables comparing Java versions and Unicode versions, or use the empirical method of testing code points for properties. Empirically, I mean the equivalent of this kind of thing:

 % ruby -le 'print "\u2C75" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"' pass 5.2 % ruby -le 'print "\uA7A0" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"' fail 6.0 % ruby -v ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0] % perl -le 'print "\x{2C75}" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"' pass 5.2 % perl -le 'print "\x{A7A0}" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"' pass 6.0 % perl -v This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level

To find out the age of a particular point in the code, run uniprops -a as follows:

 % uniprops -a 10424 U+10424 ‹𐐤› \N{DESERET CAPITAL LETTER EN} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word % uniprops -a 10424 U+10424 ‹𐐤› \N{DESERET CAPITAL LETTER EN} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

All of my Unicode tools are available in the Unicode :: Tussle package , including unichars , uninames , uniquote , ucsort and more.

Java Improvements 1.7

JDK7 is important to simplify the operation of several Unicode. I talk about this a bit at the end of my talk about OSON Unicode Support Shootout support. I thought about putting together a table in which languages support Unicode versions, in which versions of these languages, but eventually broke down so people can just get the latest version of each language. For example, I know that Unicode 6.0.0 is supported by Java 1.7, Perl 5.14, and Python 2.7 or 3.2.

JDK7 contains updates for the Character , String and Pattern classes to support Unicode 6.0.0. This includes support for Unicode script properties and several enhancements for Pattern to meet Level 1 support requirements for Unicode UTS # 18 Regular Expressions . These include

The isupper and islower now correctly match Unicode uppercase and lowercase properties; previously they were only applied incorrectly to letters, which is incorrect because it skips the code points Other_Uppercase and Other_Lowercase respectively. For example, these are several lowercase code points that are not GC=Ll (lowercase letters), only selected patterns:

 % unichars -gs '\p{lowercase}' '\P{LL}' ◌ͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI ͺ U+037A GC=Lm SC=Greek GREEK YPOGEGRAMMENI ˢ U+02E2 GC=Lm SC=Latin MODIFIER LETTER SMALL S ˣ U+02E3 GC=Lm SC=Latin MODIFIER LETTER SMALL X ᴬ U+1D2C GC=Lm SC=Latin MODIFIER LETTER CAPITAL A ᴮ U+1D2E GC=Lm SC=Latin MODIFIER LETTER CAPITAL B ᵂ U+1D42 GC=Lm SC=Latin MODIFIER LETTER CAPITAL W ᵃ U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A ᵇ U+1D47 GC=Lm SC=Latin MODIFIER LETTER SMALL B ₐ U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A ₑ U+2091 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER E ⅰ U+2170 GC=Nl SC=Latin SMALL ROMAN NUMERAL ONE ⅱ U+2171 GC=Nl SC=Latin SMALL ROMAN NUMERAL TWO ⅲ U+2172 GC=Nl SC=Latin SMALL ROMAN NUMERAL THREE ⓐ U+24D0 GC=So SC=Common CIRCLED LATIN SMALL LETTER A ⓑ U+24D1 GC=So SC=Common CIRCLED LATIN SMALL LETTER B ⓒ U+24D2 GC=So SC=Common CIRCLED LATIN SMALL LETTER C

Alphabetic tests are now correct as they use Other_Alphabetic . They did it wrong before 1.7, which is a problem.
Reset pattern \x{HHHHH} so you can meet RL1.1; this allows you to rewrite [𝒜-𝒵] (which fails due to scrolling UTF-16) as [\x{1D49C}-\x{1D4B5}] . JDK7 is the first version of Java that fully / correctly supports non-BMP characters . Surprisingly, true.
Additional properties for RL1.2, of which the script property is by far the most important. This allows you to write \p{script=Greek} , for example, abbreviated \p{Greek} .
The new UNICODE_CHARACTER_CLASSES template compilation flag and the corresponding "(?U)" template embedding flag to match RL1.2a compatibility properties.

Of course, I can understand why you want to make sure that you are using Java with Unicode 6.0.0 support, as this is associated with all these advantages.

+5

tchrist Aug 4 '11 at 14:13

source share

The Unicode version is defined in the Java Language Specification §3.1 . Since J2SE 5.0 Unicode 4.0 is supported.

+3

pmnt Aug 4 '11 at 12:35

source share

I do not think this is available through the open API. But this cannot be changed very often, so you can get the specification version:

 System.getProperties().getProperty("java.specification.version")

and based on this, determine the version of Unicode.

 java 1.0 -> Unicode 1.1 java 1.1 -> Unicode 2.0 java 1.2 -> Unicode 2.0 java 1.3 -> Unicode 2.0 java 1.4 -> Unicode 3.0 java 1.5 -> Unicode 4.0 java 1.6 -> Unicode 4.0 java 1.7 -> Unicode 6.0

To verify this, you can see the JavaDoc class for Character .

+2

Michał Šrajer Aug 4 '11 at 12:45

source share

Since the supported unicode version is determined by the Java version, you can use this information and display the unicode version based on what System.getProperty("java.version") returns.

I assume that you want to support only certain versions of Unicode or at least a minimum. I am not a Unicode expert, but since the versions seem backward compatible, you can define a Unicode version of at least 4.0, which means that the supported version of Java will be at least 5.0

+1

Thomas Aug 4 '11 at 12:42

source share

Vineet reynolds · Accepted Answer · 2011-08-04T12:42:57+0000

It is not trivial if you are looking for a class to make this information available to you.

Typically, the Unicode versions supported by Java change from one core specification to another, and this information is documented in the character class of the Java API documentation (which is derived from the Java Language specification). However, you cannot rely on the Java language specification, since each major version of Java should not have its own version of the Java language specification .

Therefore, you must transliterate between the Java version supported by the JVM and the supported Unicode version:

String specVersion = System.getProperty("java.specification.version"); if(specVersion.equals("1.7")) return "6.0"; else if(specVersion.equals("1.6")) return "4.0"; else if(specVersion.equals("1.5")) return "4.0"; else if(specVersion.equals("1.4")) return "3.0"; ... and so on

Supported version information can be obtained from the Java language specification. Referring to JSR 901 , which is a specification of the Java 7 language:

The Java SE platform follows the Unicode specification as it evolves. The exact version of Unicode used by this version is specified in the class documentation.
Java versions of the programming language prior to version 1.1 used Unicode version 1.1.5. Updates to new versions of the Unicode standard in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (for Unicode 2.1), Java SE 1.4 (for Unicode 3.0) and Java SE 5.0 (to Unicode 4.0).

How to identify programmatically in Java, supported by the Unicode version?

Java Improvements 1.7

More articles: