Can I depend on the behavior of charCodeAt () and fromCharCode () to stay the same?

I wrote a personal web application that uses charCodeAt() to convert text that the user enters into the corresponding character codes (for example, βŠ‡ converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To get input, the application uses fromCharCode() to convert numbers back to text.

I decided to do this because Perl's unicode support is very difficult to work properly. Therefore, Perl and MySQL only see numbers, which greatly simplifies life.

My question is: can I depend on fromCharCode() to always convert a number like 8834 to the corresponding character? I don’t know which standard it uses, but let it use UTF-8, if in the future it will be used to use UTF-16, this will obviously break my program if there is no backward compatibility.

I know that my ideas about these concepts are not so clear, so please do not forget to clarify whether I seemed to be a misunderstanding.

+6
source share
6 answers

fromCharCode and toCharCode work with Unicode code points, i.e. numbers between 0 and 65535 (0xffff), if all the characters are on the base multilingual plane (BMP). Unicode and code points are permanent, so you can trust them to stay the same forever.

Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a stream of bytes. JavaScript is somewhat strange in that characters outside of BMP must be constructed by two toCharCode calls in accordance with UTF-16 rules. However, almost every character you will ever encounter (including Chinese, Japanese, etc.) is in BMP, so your program will work even if you do not handle these cases.

One thing you can do is convert the numbers back to bytes (in big-endian int16 format) and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode fixed in current JavaScript implementations and will never change.

+9
source

I decided to do this because Perl's unicode support is very difficult to work properly.

This value is ɴᴏᴛ true!

Perl has the most powerful Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you used any C, C ++ Java, C β™― Python, Ruby, PHP or Javascript. This is not hyperbole and boosterism from uneducated, blind fidelity .; it is a valued assessment based on more than a decade of professional experience and study.

The problems that naive users face are almost always related to the fact that they tricked themselves into what Unicode is. Worst brain-error number one thinks Unicode is similar to ASCII, but more. This is absolutely and completely wrong. As I wrote elsewhere:

It is fundamentally and critically not true that UΙ΄Ιͺᴄᴏᴅᴇ is just some extended character set with respect to α΄€sα΄„ΙͺΙͺ. At best, that’s nothing more than the stunning Ιͺsᴏ-10646. UΙ΄Ιͺᴄᴏᴅᴇ includes much more that just assigning numbers to glyphs: sorting and matching rules, three casing shapes, a non-letter wrapper, multi-code examples, both canonical and compatible folded and unfolded shape normalization, serialization forms, grapheme clusters, word breaks and lines, scripts, numerical equivalents, widths, bidirectionality, mirroring, print width, exclusion of logical ordering, glyph options, contextual behavior, locales, regular expressions, several forms of concatenation as the class, multiple types of expansions, hundreds critically useful features and much more!

Yes, this is a lot, but it has nothing to do with Perl. This is due to Unicode. This Perl allows you to access these things when you are working with Unicode, not a bug, but a function. The fact that these other languages ​​do not allow you to get full access to Unicode can in no way be interpreted as a point in their favor: rather, these are all the main mistakes with the highest possible seriousness, because if you cannot work with Unicode in the 21st century , then this language is primitive, broken and fundamentally useless for the demanding requirements of modern text processing.

Perl is not. And simpler than simpler than in other languages, it's easier to do in Perl. in most of them you can’t even start working on your design flaws. You are just screwed. If the language does not provide full Unicode support, it is not suitable for this century; discard him.

Perl makes Unicode infinitely simpler than languages ​​that prevent Unicode from being used properly.

In this answer, you will find on the first page Seven simple steps for working with Unicode in Perl and at the bottom of the same answer you will find some code templates that will help. Understand this, then use it. Do not accept the violation. You need to learn Unicode before you can use Unicode.

And that is why there is no easy answer. Perl simplifies Unicode, provided that you understand what Unicode is. And if you are dealing with external sources, you need to organize some kind of encoding for this source.

Also read everything I said about π”Έπ•€π•€π•¦π•žπ•– π”Ήπ•£π• π•œπ•–π•Ÿπ•Ÿπ•–π•€π•€. This is what you really need to understand. Another breakdown problem that drops out of Rule # 49 is that Javascript is corrupted because it does not handle all valid Unicode code points in exactly the same way, regardless of their plane. Javascript is broken in almost all other ways. This is not suitable for Unicode. Just Rule # 34 will kill you, since you cannot get Javascript to follow the standard that things like \w are defined in Unicode regexes .

It's amazing how many languages ​​are completely useless for Unicode. But Perl is most definitely not one of them!

+5
source

In my opinion, it will not break.

Read Joel Spolsky's article on Unicode and character encoding . The relevant part of the article is given below:

Each letter in each alphabet is assigned a Unicode consortium number, which is written like this: U + 0639. This number is called a code point. U + stands for Unicode, and the numbers are hexadecimal. The English letter AU + 0041.

It doesn't matter if this magic number is encoded in utf-8 or utf-16 or any other encoding. The number will still match.

+4
source

As stated in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point on the base multilingual plane (BMP). JavaScript strings are encoded in UCS-2 encoding, and any point in the code outside the BMP is represented as two JavaScript characters. None of this will change.

To handle any Unicode character from the JavaScript side, you can use the following function, which will return an array of numbers representing a sequence of Unicode code points for the specified string:

 var getStringCodePoints = (function() { function surrogatePairToCodePoint(charCode1, charCode2) { return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000; } // Read string in character by character and create an array of code points return function(str) { var codePoints = [], i = 0, charCode; while (i < str.length) { charCode = str.charCodeAt(i); if ((charCode & 0xF800) == 0xD800) { codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i))); } else { codePoints.push(charCode); } ++i; } return codePoints; } })(); var str = "πŒ†"; var codePoints = getStringCodePoints(s); console.log(str.length); // 2 console.log(codePoints.length); // 1 console.log(codePoints[0].toString(16)); // 1d306 
+4
source

JavaScript strings are UTF-16, this is not what will be changed.

But do not forget that UTF-16 is a variable length encoding.

+3
source

In 2018, you can use String.codePointAt () and String.fromCodePoint ().

These methods work even if the character is not in the base multilingual plane (BMP).

0
source

Source: https://habr.com/ru/post/889809/


All Articles