Why not everything we do in Unicode?

Given that Unicode is around 18 years old , why are there still applications that do not support Unicode? Even my experience with some operating systems and Unicode was at least painful. As Joel Spolsky pointed out in 2003, this is not so difficult. So what is a deal? Why can't we put it together?

+48
unicode internationalization
Jun 11 '09 at 3:44
source share
17 answers

Start with a few questions.

How often...

  • Need to write an application that does something other than ascii?
  • Need to write a multilingual app?
  • Are you writing an application that should be multilingual from its first version?
  • Have you heard that Unicode is used to represent characters other than ascii?
  • Have you read that Unicode is an encoding? Is this unicode encoding?
  • Do you see how people confuse UTF-8 encoded bytes and Unicode data?

Do you know the difference between sorting and encoding?

Where did you first learn about Unicode?

  • At school? (In fact?)
  • at work?
  • to a fashion blog?

Have you ever, in your youth, experienced moving source files from system to locale A to system to locale B, edited a typo in system B, saved files, b0rking all comments not related to ascii, and .. end up spending a lot time to understand what happened? (did your editor mix things up? compiler? system? ...?)

Did you eventually decide that you will never comment on your code again with characters other than ascii?

See what is done elsewhere

Python

Did I mention that I love Python? No? Well, I love Python.

But while Python3.0, its Unicode support is not absorbed. And there were all these novice programmers who at that time knew how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to figure out characters other than ascii. Well, they basically got injured from living with a Unicode monster, and I know a lot of very efficient / experienced Python encoders that are still scared today by the idea of ​​dealing with Unicode data.

And with Python3 there is a clear separation between Unicode and bytestrings, but ... look how difficult it is to port an application from Python 2.x to Python 3.x, if you weren’t worried about separation before / if you really don’t understand what Unicode is.

Databases, PHP

Do you know the popular commercial website that stores its international text as Unicode?

You might be surprised to learn that the Wikipedia backend does not save its data using Unicode. All text is encoded in UTF-8 and stored as binary data in a database.

One of the key issues is how to sort text data if you store it as Unicode code pages. Here are Unicode sorts that determine the sort order of Unicode codes. But the correct sorting support in the Databases is missing / is under active development. (There are probably also a lot of performance issues. - IANADBA). Also, there is no universally accepted standard for comparisons: for some languages, people do not agree on how to sort words / letters / words.

Have you heard of Unicode normalization ? (Basically, you should convert the data in Unicode to a canonical representation before storing it). Of course, this is important for database storage or local comparisons. But PHP, for example, provides support for normalization with 5.2.4, which was released in August 2007.

And in fact, PHP does not yet fully support Unicode. We will have to wait for PHP6 to work everywhere with Unicode-compatible functions.

So why not all that we do in Unicode?

  • Some people don't need Unicode.
  • Some people don't care.
  • Some people do not realize that they will need Unicode support later.
  • Some people do not understand Unicode.
  • For some others, Unicode is a bit like accessibility for webapps: you start without it and add support for it later
  • Many popular libraries / languages ​​/ applications do not have proper full Unicode support, not to mention sorting and normalization problems. And while all the elements of your development stack do not fully support Unicode, you cannot write a clean Unicode program.

The Internet clearly helps spread the Unicode trend. And this is good. Initiatives like Python3 succeeding one another help educate people about this issue. But we have to patiently wait a bit to see Unicode everywhere and new programmers instinctively, using Unicode instead of strings, where it matters.

For a joke, because FedEx does not seem to support international addresses,

+55
Jun 11 '09 at 6:46
source share
  • Many developers do not believe that their applications are used in Asia or other regions where Unicode is required.
  • Converting existing applications to Unicode is expensive and usually driven by sales opportunities.
  • Many companies have products supported on legacy systems, and switching to Unicode means a completely new development platform.
  • You would be surprised how many developers do not understand the implications of Unicode in a multilingual environment. This is not just a case of using wide strings.

The bottom line is the cost.

+22
Jun 11 '09 at 3:54
source share

Probably because people are accustomed to ASCII, and many programmers are run by native English speakers.

IMO, this is a function of a collective habit, not a conscious choice.

+14
Jun 11 '09 at 3:47
source share

The wide availability of development tools for working with Unicode may be a more recent event than you expect. Working with Unicode was, just a few years ago, the painful task of converting character formats and handling incomplete or buggy implementations. You say that it’s not so difficult, and as the tools improve, it becomes more believable, but there are many ways to do this if the details are not hidden from you by good languages ​​and libraries. Damn, just abbreviation and insertion of Unicode characters could be a dubious proposition a few years ago. Developer training also took some time, and you still see that people make a lot of really basic mistakes.

The Unicode standard weighs probably ten pounds. Even just a review of this issue would have to discuss the subtle differences between characters, glyphs, code points, etc. Now think about ASCII. It is 128 characters. I can explain the whole thing to someone who knows the binary after about 5 minutes.

I believe that almost all software should be written with full Unicode support these days, but it was a long way to creating a truly international character set with encoding for a variety of purposes, and it has not yet ended.

+14
Jun 11 '09 at 4:12
source share

Laziness, ignorance.

+9
Jun 11 '09 at 3:46
source share

One of the huge factors is the support of a programming language, most of which use a character set that corresponds to 8 bits (for example, ASCII) by default for strings. The Java String class uses UTF-16, and there are others that support Unicode variants, but many languages ​​prefer simplicity. Space is so trivially troubling that encoders that cling to "space-efficient" strings must hit. Most people just don’t work on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.

Another factor is that many programs are written to run in English only, and developers (1) do not plan (or even know how) to localize their code for several languages, and (2) they often don’t even think about processing input data in non-Roman languages. English is the dominant natural language spoken by programmers (at least to communicate with each other) and is largely ported to the software that we produce. However, apathy and / or ignorance certainly cannot last forever ... Given the fact that the mobile market in Asia completely eclipses most of the rest of the world, programmers will have to deal with Unicode pretty soon, regardless of whether they like it or not.

For what it's worth, I don't think that the complexity of the Unicode standard is not as great as a factor conducive to programmers, but rather for those who need to implement language support. When programming in a language where the hard work is already done, there is even less reason not to use the tools at hand. C'est la vie, old habits die hard.

+9
Jun 11 '09 at 4:38
source share

Until recently, all operating systems were built on the assumption that the character was a byte. These APIs were built this way, the tools were created that way, the languages ​​were created that way.

Yes, it would be much better if everything I wrote was already ... err ... UTF-8? UTF-16? UTF-7? UTF-32? Err ... mmm ... It seems that whatever you choose, you will annoy someone. And, in fact, this is true.

If you choose UTF-16, then all your data, as a whole in the economy of the entire western world, ceases to be easily read, because you lose compatibility with ASCII. Add to this, the byte ceases to be a character that seriously violates the assumptions on which software is built today. In addition, some countries do not accept UTF-16. Now, if you choose ANY coding of variable length, you break down some basic premises of a lot of software, for example, you don’t need to cross the line to search for the nth character in order to read the line from anywhere.

And then UTF-32 ... well, that is four bytes. What was the average hard drive size or memory size, but 10 years ago? UTF-32 was too big!

So, the only solution is to change everything - software, utilities, operating systems, languages, tools - at the same time, to be in the know. Well. Good luck with "at the same time."

And if we cannot do everything at the same time, then we always need to keep track of things that were not i18n. This causes a vicious cycle.

This is easier for end-user applications than for middleware or core software, and some new languages ​​are built in this way. But ... we still use Fortran libraries written in the 60s. This is a heritage; it does not go away.

+6
Jun 11. '09 at 4:41
source share

Because UTF-16 became popular before UTF-8, and UTF-16 is a pig for work. IMHO

+6
Jun 11 '09 at 4:47
source share

Since for 99% of applications, Unicode support is not a flag in the client product comparison matrix.

Add to the equation:

  • It takes a conscious effort, with little or no apparent benefit.
  • Many programmers are afraid of this or do not understand it.
  • Management REALLY does not understand and does not care about it, at least until the client screams about it.
  • The test team does not test Unicode compliance.
  • "We do not localize the user interface, so non-English speakers will not use it anyway."
+4
Jul 20 '09 at 18:09
source share

Tradition and attitude. ASCII and computers, unfortunately, are synonymous with many people.

However, it would be naive to think that the role of Unicode is only a matter of exotic languages ​​from Eurasia and other parts of the world. A rich text encoding makes a lot of sense to bring even to a “plain” English text. Look at some book.

+3
Jul 20 '09 at 17:56
source share

I would say that there are basically two reasons. First, it’s just that the Unicode support of your tools is simply not up to it. C ++ still does not support Unicode and will not receive it until the next standard version, which may take a year or two, and then another five or ten years will be widely used. Many other languages ​​are not much better, and even if you finally get Unicode support, it may still be more cumbersome to use simple ASCII strings.

The second reason is partly because it triggers the first release, Unicode is not a science rocket, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear relationship of one byte == one glyph , you could access the Nth character of a string with simple str[N] , you could just store all the characters of the whole set in memory and so on. With Unicode you can no longer do this, you have to deal with different Unicode encoding methods (UTF-8, UTF-16, ...), byte bytes, decoding errors, a lot of fonts that have only a subset of the characters that you will need for full Unicode support, more glyphs that you want to keep in memory at a given time, and so on.

ASCII can be understood simply by looking at the ASCII table without any additional documentation, with Unicode, which simply no longer holds.

+2
Sep 10 '09 at 21:26
source share

Due to inertia caused by C ++. He had (has) terrible Unicode support and dragged developers.

+2
Nov 12 '10 at 21:58
source share

Additional overhead.

0
Jun 11 '09 at 3:46
source share

I suspect this because the software has such strong roots in the west. UTF-8 is a good compact format if you live in America. But it is not so hot if you live in Asia .;)

0
Jun 11 '09 at 3:49
source share

Unicode requires more work (thinking), you usually get paid only for what is required, so you come with the fastest and less complicated option.

Good thing from my point of view. I think if you expect the code to use std::wstring hw(L"hello world") , you need to explain how it works, to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... It seems to me that more work for me - of course, if I were writing an international application, I would have to invest in clarifying it, but until then I have not done it (as I suspect most developers).

I think it is back to money, time is money.

0
Jun 11 '09 at 3:55
source share

It's simple. Since we only have ASCII characters on our keyboards, why should we ever meet or take care of other characters? This is not so much an attitude as something that happens when a programmer has never had to think about this problem or has never encountered it, and may not even know what Unicode is.

edit: In other words, Unicode is what you should think about, and thinking is not what most people are interested in, even programmers.

0
Jun 11 '09 at 4:06
source share

I personally don’t like how some unicode formats break it up so that you can no longer do string [3] to get the third character. Of course, this can be abstracted, but imagine how slower a large project with strings would be, for example GCC, if it had to be across the string to figure out the nth character. The only option is caching, where there are “useful” positions and even then they slow down, and in some formats you now get a good 5 bytes per character. For me it's just ridiculous.

0
Jun 11 '09 at 4:51
source share



All Articles