Best way to find out if text in java string contains UTF-8 encoded characters or not

Question

Best way to find out if text in java string contains UTF-8 encoded characters or not

is there any other way to find out if java String character-encoding in UTF-8 or not, like Arabic words.

I tried this code: but does it exactly do the job?

 char c = 'أ'; int num = (int) c; if(num> 128) // then UTF-8 characters exists

+4

java encoding character-encoding

confucius Mar 22 '12 at 15:35

source share

3 answers

Java (internally) always encodes a String in UTF-16, regardless of its contents. http://docs.oracle.com/javase/6/docs/api/java/lang/String.html

You can convert it to any supported encoding, including ASCII and UTF-8, but you may lose characters that are not displayed in the selected encoding.

Depending on why you are checking, you can convert the string to ASCII and read it back to java string and see if they match. If so, ASCII is enough to store your string. This would be the most obvious check for later readers of your source code.

You can also compare the unicode code point of each character with 128, if they are all <= 127, the string is ASCII compatible, i.e. does not contain arabica. To get the unicode encoding code for the character of your string, use str.codePointAt(index) .

If you explicitly want to find Arabic text, you must explicitly check for Arabic characters. Otherwise, you may receive false positives for French, German, or many other languages that use accented characters. Fortunately, the Unicode consortium binds blocks in one language, so the check probably comes down to ~~cp >= beginningOfUnicodeBlock && cp <= endOfUnicodeBlock~~ .

Change scheduled by tchrist: there is java.lang.Character.UnicodeBlock and java.lang.Character.UnicodeScript . The latter was added in Java 7. Both of them can be used to classify Unicode codes.

 int cp = str.codePointAt(index); if (UnicodeScript.ARABIC.equals(UnicodeScript.of(cp)) { // arabic character found }

+1

user1252434 Mar 22 '12 at 16:02

source share

I do not believe that there is a final way to find out with 100% accuracy. UTF-8 and UTF-16 may come with an optional Order Mark byte , which you may find. There is no guarantee that he will be there, but many tools include them, especially for UTF-16, as this is more important.

Apache Commons IO includes a convenient BOMInputStream class for reading flow-marked data that is easy to use:

 BOMInputStream bomIn = new BOMInputStream(in); if (bomIn.hasBOM()) { // has a UTF-8 BOM }

-1

arooaroo Mar 22 '12 at 15:54

source share

Bart van heukelom · Accepted Answer · 2012-03-22T15:38:23+0000

(Assuming UTF-8 == non-ASCII)

What you can do is encode, then decode the string in ASCII and compare the result with the original. If they are not equal, there are non-ASCII characters.

However, your own sample will also work (almost should be >= 128 ), because the following proves that indeed all char < 128 are ASCII:

For backward compatibility, in 128 ASCII and 256 ISO-8859-1 (Latin 1) characters are assigned Unicode / UCS code points, which are the same as their codes in previous standards.

The first plane (codes U + 0000 - U + FFFF) contains the most commonly used characters and is called the base multilingual plane or BMP. Both UTF-16 and UCS-2 encode valid code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.

("UTF-16" and "ASCII", Wikipedia)

And char are UTF-16 code units.

However, judging by this question as a whole, you could be better off reading Absolute Minimum Every software developer should absolutely positively know about Unicode and character sets, (No excuses!) .

Best way to find out if text in java string contains UTF-8 encoded characters or not

More articles: