Extended from this comment .
As mentioned in the comments of @pvg, the line resulting from readAsBinaryString would be correct, but it would be corrupted for two reasons:
a. The result is encoded in ISO-8859-1. You can use the function to fix this:
function convertFrom1to7(text) { // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format: // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!". // - If the character is a Greek char with 720 subtracted from its char code, use a ".". // - Otherwise, use \uXXXX format. var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!"; var newtext = "", newchar = ""; for (var i = 0; i < text.length; i++) { var char = text[i]; newchar = char; if (char.charCodeAt(0) >= 160) { newchar = charset[char.charCodeAt(0) - 160]; if (newchar === "!") newchar = char; if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720); } newtext += newchar; } return newtext; }
C. The Chinese character is not part of ISO-8859-7 encoding (since the encoding supports up to 256 unique characters, as the table shows). If you want to include arbitrary Unicode characters in a program, you probably need to complete one of these two tasks:
- Count the bytes of this program, i.e. UTF-8 or UTF-16. This can be done quite easily with the library you linked. However, if you want this to be done automatically, you will need a function that checks if the contents of the text area are a valid ISO-8859-7 file, for example:
function isValidISO_8859_7(text) { var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/; var valid = true; for (var i = 0; i < text.length; i++) { valid = valid && charset.test(text[i]); } return valid; }
- Create your own custom version of ISO-8859-7 that uses a specific byte (or more than one) to indicate that the next 2 or 3 bytes belong to the same Unicode char. It can be as simple or complex as you like, from one char representing a 2-byte char, and one meaning a 3-byte connection for everything between
80 and 9F for the next few. Here is a basic example that uses 80 as a 2-byte and 81 as 3-byte (assuming the text is encoded in ISO-8859-1):
function reUnicode(text) { var newtext = ""; for (var i = 0; i < text.length; i++) { if (text.charCodeAt(i) === 0x80) { newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i)); } else if (text.charCodeAt(i) === 0x81) { var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536; newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023));
I can go to any method in more detail if you wish.
source share