Java String.getBytes ("UTF8") JavaScript equivalent

Byte to line and back

The functions written there work correctly that pack(unpack("string")) gives "string" . But I would like to have the same result as "string".getBytes("UTF8") in Java.

The question is how to make a function providing the same functions as Java getBytes ("UTF8") in JavaScript?

For Latin unpack(str) strings from the above article, the same result as getBytes("UTF8") , except that it adds 0 for odd positions. But with non-Latin strings it works in a completely different way, as it seems to me. Is there a way to work with string data in JavaScript, like Java?

+4
source share
4 answers

You can use this function ( gist ):

 function toUTF8Array(str) { var utf8 = []; for (var i=0; i < str.length; i++) { var charcode = str.charCodeAt(i); if (charcode < 0x80) utf8.push(charcode); else if (charcode < 0x800) { utf8.push(0xc0 | (charcode >> 6), 0x80 | (charcode & 0x3f)); } else if (charcode < 0xd800 || charcode >= 0xe000) { utf8.push(0xe0 | (charcode >> 12), 0x80 | ((charcode>>6) & 0x3f), 0x80 | (charcode & 0x3f)); } else { // let keep things simple and only handle chars up to U+FFFF... utf8.push(0xef, 0xbf, 0xbd); // U+FFFE "replacement character" } } return utf8; } 

Usage example:

 >>> toUTF8Array("中€") [228, 184, 173, 226, 130, 172] 

If you want negative numbers for values ​​greater than 127, for example, Java byte-int conversion, you need to configure the constants and use

  utf8.push(0xffffffc0 | (charcode >> 6), 0xffffff80 | (charcode & 0x3f)); 

and

  utf8.push(0xffffffe0 | (charcode >> 12), 0xffffff80 | ((charcode>>6) & 0x3f), 0xffffff80 | (charcode & 0x3f)); 
+3
source

You do not need to write a complete UTF-8 encoder; there is a much simpler JS idiom for converting a Unicode string to a byte string representing UTF-8 code units:

 unescape(encodeURIComponent(str)) 

(This works because the odd encoding used by escape / unescape uses the %xx hexadecimal sequences to represent ISO-8859-1 characters with this code instead of the UTF-8 used to escape the URI component. Similar to decodeURIComponent(escape(bytes)) going in the other direction.)

So, if you want to get an array, it will be:

 function toUTF8Array(str) { var utf8= unescape(encodeURIComponent(str)); var arr= new Array(utf8.length); for (var i= 0; i<utf8.length; i++) arr[i]= utf8.charCodeAt(i); return arr; } 
+6
source

TextEncoder is part of the Encoding Living Standard and, according to the Encoding API from Chromium Dashboard, it is sent to Firefox and sent to Chrome 38. There is also text-encoding polyfill available for other browsers.

The following JavaScript code example returns a Uint8Array with the expected values.

 (new TextEncoder()).encode("string") // [115, 116, 114, 105, 110, 103] 

A more interesting example showing that in later versions of UTF-8 replaces in with string by îñ :

 (new TextEncoder()).encode("strîñg") [115, 116, 114, 195, 174, 195, 177, 103] 
+2
source

The following function will deal with the above U + FFFF.

Since javascript text is in UTF-16, two characters are used in the string to represent the character above BMP, and charCodeAt returns the corresponding surrogate code. A fixed CharCodeAt handles this.

 function encodeTextToUtf8(text) { var bin = []; for (var i = 0; i < text.length; i++) { var v = fixedCharCodeAt(text, i); if (v === false) continue; encodeCharCodeToUtf8(v, bin); } return bin; } function encodeCharCodeToUtf8(codePt, bin) { if (codePt <= 0x7F) { bin.push(codePt); } else if (codePt <= 0x7FF) { bin.push(192 | (codePt >> 6), 128 | (codePt & 63)); } else if (codePt <= 0xFFFF) { bin.push(224 | (codePt >> 12), 128 | ((codePt >> 6) & 63), 128 | (codePt & 63)); } else if (codePt <= 0x1FFFFF) { bin.push(240 | (codePt >> 18), 128 | ((codePt >> 12) & 63), 128 | ((codePt >> 6) & 63), 128 | (codePt & 63)); } } function fixedCharCodeAt (str, idx) { // ex. fixedCharCodeAt ('\uD800\uDC00', 0); // 65536 // ex. fixedCharCodeAt ('\uD800\uDC00', 1); // 65536 idx = idx || 0; var code = str.charCodeAt(idx); var hi, low; if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters) hi = code; low = str.charCodeAt(idx+1); if (isNaN(low)) { throw(encoding_error.invalid_surrogate_pair.replace('%pos%', idx)); } return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000; } if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration return false; /*hi = str.charCodeAt(idx-1); low = code; return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/ } return code; } 
0
source

Source: https://habr.com/ru/post/1435356/


All Articles