How to read Unicode G-Clef (U + 1D11E) from a file?

Question

How to read Unicode G-Clef (U + 1D11E) from a file?

G-Clef (U + 1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bits. Almost all Java read functions return only char or int , which also contain only 16 bits . Which function reads full Unicode characters, including SMP, SIP, TIP, SSP, and PUA?

Update

I asked how to read a single Unicode character (or code point) from an input stream. I have no integer array and don't want to read a string.

You can build code with Character.toCodePoint() , but this function requires char . Reading char , on the other hand, is not possible because read() returns an int . My best work so far is this, but it still contains unsafe roles:

 public int read_code_point (Reader input) throws java.io.IOException { int ch16 = input.read(); if (Character.isHighSurrogate((char)ch16)) return Character.toCodePoint((char)ch16, (char)input.read()); else return (int)ch16; }

How to do it better?

Update 2

Another version that returns String but still uses casts:

 public String readchar (Reader input) throws java.io.IOException { int i16 = input.read(); // UTF-16 as int if (i16 == -1) return null; char c16 = (char)i16; // UTF-16 if (Character.isHighSurrogate(c16)) { int low_i16 = input.read(); // low surrogate UTF-16 as int if (low_i16 == -1) throw new java.io.IOException ("Can not read low surrogate"); char low_c16 = (char)low_i16; int codepoint = Character.toCodePoint(c16, low_c16); return new String (Character.toChars(codepoint)); } else return Character.toString(c16); }

The remaining question is: are receptions safe or how to avoid them?

+6

java unicode

ceving Jun 28 '13 at 9:14

source share

2 answers

Full Unicode can be represented in both UTF-8 and UTF-16 byte sequences, respectively. byte pairs ("java chars"). From String, the full Unicode code point can be extracted using:

 int[] codePoints = { 0x1d11e }; String s = new String(codePoints, 0, codePoints.length); for (int i = 0; i < s.length(); ) { int cp = s.codePointAt(i); i += Character.charCount(cp); }

For a file with Latin letters, UTF-8 looks fine.

The following are the full standard Unicode file (in UTF-8):

 try (BufferedReader in = new BufferedReader( new InputStreamReader(new FileInputStream(file), "UTF-8"))) { for (;;) { String line = in.readLine(); if (line == null) { break; } ... do some thing with a Unicode line ... } } catch (FileNotFoundException e) { System.err.println("No file: " + file.getPath()); } catch (IOException e) { ... }

A function that supplies a Java string of one (or several Unicode codes):

 String s = unicodeToString(0x1d11e); String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e); public static String unicodeToString(int... codepoints) { return new String(codePoints, 0, codePoints.length); }

-1

Joop eggen Jun 28 '13 at 9:52

source share

Ian roberts · Accepted Answer · 2013-06-28T12:09:48+0000

My best job so far is, but it still contains unsafe roles

The only insecure information about the code you ch16 that ch16 could be -1 if input reached EOF. If you check this condition first, you can guarantee that other drives (char) safe, because Reader.read() specified to return -1 or a value in the char range (0 - 0xFFFF).

 public int read_code_point (Reader input) throws java.io.IOException { int ch16 = input.read(); if (ch16 < 0 || !Character.isHighSurrogate((char)ch16)) return ch16; else { int loSurr = input.read(); if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) return ch16; // or possibly throw an exception else return Character.toCodePoint((char)ch16, (char)loSurr); } }

This is still not perfect, in fact you need to handle the edge where the first char reading is a high surrogate, but the second is not a suitable low surrogate, in which case you probably want to return the char as-is and the reader backup first. so that the next reading gives you the next character. But this only works if input.markSupported() == true . If you can guarantee, then what about

 public int read_code_point (Reader input) throws java.io.IOException { int firstChar = input.read(); if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) { return firstChar; } else { input.mark(1); int secondChar = input.read(); if(secondChar < 0) { // reached EOF return firstChar; } else if(!Character.isLowSurrogate((char)secondChar)) { // unpaired surrogates, un-read the second char input.reset(); return firstChar; } else { return Character.toCodePoint((char)firstChar, (char)secondChar); } } }

Or you can wrap the original reader in a PushbackReader and use unread(secondChar)

How to read Unicode G-Clef (U + 1D11E) from a file?

More articles: