Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Background

I am writing an esoteric language called Jolf . It is used on the excellent codegolf SE site. If you do not already know, a lot of problems are evaluated in bytes. People have made many languages that use either their own encoding or an existing encoding.

In my language interpreter, I have a byte counter. As expected, it counts the number of bytes in the code. So far I have used UTF-8 en / decoder ( utf8.js ). Now I use the ISO 8859-7 encoding, which has Greek characters. Actually loading the text doesn't really work. I need to count the actual bytes contained in the downloaded file. Also, is there a way to read the contents of a specified encoded file?

Question

Given the file encoded in ISO 8859-7, obtained from the <input> element on the page, is there a way to get the number of bytes contained in this file? And, given the "plaintext" (i.e. Text placed directly in the <textarea> ), how can I read the bytes in this as if it were encoded in ISO 8859-7?

What i tried

The input element is called isogreek . The file is in the <input> element. The contents of ΦX族 are the Greek character, the Latin character (each of which must be a byte) and the Chinese character, which must be more than one byte (?).

 isogreek.files[0].size; // is 3; should be more. var reader = new FileReader(); reader.readAsBinaryString(isogreek.files[0]); // corrupts the string to `ÖX?` reader.readAsText(isogreek.files[0]); //  X? reader.readAsText(isogreek.files[0],"ISO 8859-7"); //  X? 
+5
source share
3 answers

Extended from this comment .

As mentioned in the comments of @pvg, the line resulting from readAsBinaryString would be correct, but it would be corrupted for two reasons:

a. The result is encoded in ISO-8859-1. You can use the function to fix this:

 function convertFrom1to7(text) { // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format: // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!". // - If the character is a Greek char with 720 subtracted from its char code, use a ".". // - Otherwise, use \uXXXX format. var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!"; var newtext = "", newchar = ""; for (var i = 0; i < text.length; i++) { var char = text[i]; newchar = char; if (char.charCodeAt(0) >= 160) { newchar = charset[char.charCodeAt(0) - 160]; if (newchar === "!") newchar = char; if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720); } newtext += newchar; } return newtext; } 

C. The Chinese character is not part of ISO-8859-7 encoding (since the encoding supports up to 256 unique characters, as the table shows). If you want to include arbitrary Unicode characters in a program, you probably need to complete one of these two tasks:

  • Count the bytes of this program, i.e. UTF-8 or UTF-16. This can be done quite easily with the library you linked. However, if you want this to be done automatically, you will need a function that checks if the contents of the text area are a valid ISO-8859-7 file, for example:
 function isValidISO_8859_7(text) { var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/; var valid = true; for (var i = 0; i < text.length; i++) { valid = valid && charset.test(text[i]); } return valid; } 
  1. Create your own custom version of ISO-8859-7 that uses a specific byte (or more than one) to indicate that the next 2 or 3 bytes belong to the same Unicode char. It can be as simple or complex as you like, from one char representing a 2-byte char, and one meaning a 3-byte connection for everything between 80 and 9F for the next few. Here is a basic example that uses 80 as a 2-byte and 81 as 3-byte (assuming the text is encoded in ISO-8859-1):
 function reUnicode(text) { var newtext = ""; for (var i = 0; i < text.length; i++) { if (text.charCodeAt(i) === 0x80) { newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i)); } else if (text.charCodeAt(i) === 0x81) { var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536; newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair } else { newtext += convertFrom1to7(text[i]); } } return newtext; } 

I can go to any method in more detail if you wish.

+6
source

The three characters you indicated as an example are decoded in 6 bytes a6 ce e6 58 8f 97 (0x58 = X). Also: JavaScript works with utf16, which leads to some funny things like ("abc".length === "ΦX族".length) being true.

You most likely need to go to full length and check each individual character for its length by its code value. You may also need to check two characters in some cases (utf-32 to utf-16). In addition, it is necessary to install and check the specification (it is always necessary if you are working with files of unknown sources).

EDIT: Added by request:

JavaScript character encodings are always in utf-16, a double-byte character representation. Everything was good and pleasant until they suddenly (ha!) Found out that for all the alphabets of the world two bytes are not enough, therefore the Unicode range has been expanded to four bytes: utf-32.

Well, the Unicode consortium did this, but the ECMA committee did not.

This is not to say that hell broke loose, but in some cases it is pretty close, and one of them is your case because you want to mix single-byte encodings with multi-byte encodings, even different ones.

One byte fits well in two bytes, but three or more bytes do not fit well in two bytes, so-called surrogates were created. These surrogates are also the reason why changing a string in JavaScript is not so simple.

As I said: a large can of worms.

+2
source

As an aside, if you want the ETHProductions code to be browser friendly, use this instead:

 function convertFrom1to7(text) { // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format: // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!". // - If the character is a Greek char with 720 subtracted from its char code, use a ".". // - Otherwise, use \uXXXX format. var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!"; var newtext = "", newchar = ""; for (var i = 0; i < text.length; i++) { var char = text[i]; newchar = char; if (char.charCodeAt(0) >= 160) { newchar = charset[char.charCodeAt(0) - 160]; if (newchar === "!") newchar = char; if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720); } newtext += newchar; } return newtext; } 
+2
source

Source: https://habr.com/ru/post/1240580/


All Articles