Start with a few questions.
How often...
- Need to write an application that does something other than ascii?
- Need to write a multilingual app?
- Are you writing an application that should be multilingual from its first version?
- Have you heard that Unicode is used to represent characters other than ascii?
- Have you read that Unicode is an encoding? Is this unicode encoding?
- Do you see how people confuse UTF-8 encoded bytes and Unicode data?
Do you know the difference between sorting and encoding?
Where did you first learn about Unicode?
- At school? (In fact?)
- at work?
- to a fashion blog?
Have you ever, in your youth, experienced moving source files from system to locale A to system to locale B, edited a typo in system B, saved files, b0rking all comments not related to ascii, and .. end up spending a lot time to understand what happened? (did your editor mix things up? compiler? system? ...?)
Did you eventually decide that you will never comment on your code again with characters other than ascii?
See what is done elsewhere
Python
Did I mention that I love Python? No? Well, I love Python.
But while Python3.0, its Unicode support is not absorbed. And there were all these novice programmers who at that time knew how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to figure out characters other than ascii. Well, they basically got injured from living with a Unicode monster, and I know a lot of very efficient / experienced Python encoders that are still scared today by the idea of dealing with Unicode data.
And with Python3 there is a clear separation between Unicode and bytestrings, but ... look how difficult it is to port an application from Python 2.x to Python 3.x, if you weren’t worried about separation before / if you really don’t understand what Unicode is.
Databases, PHP
Do you know the popular commercial website that stores its international text as Unicode?
You might be surprised to learn that the Wikipedia backend does not save its data using Unicode. All text is encoded in UTF-8 and stored as binary data in a database.
One of the key issues is how to sort text data if you store it as Unicode code pages. Here are Unicode sorts that determine the sort order of Unicode codes. But the correct sorting support in the Databases is missing / is under active development. (There are probably also a lot of performance issues. - IANADBA). Also, there is no universally accepted standard for comparisons: for some languages, people do not agree on how to sort words / letters / words.
Have you heard of Unicode normalization ? (Basically, you should convert the data in Unicode to a canonical representation before storing it). Of course, this is important for database storage or local comparisons. But PHP, for example, provides support for normalization with 5.2.4, which was released in August 2007.
And in fact, PHP does not yet fully support Unicode. We will have to wait for PHP6 to work everywhere with Unicode-compatible functions.
So why not all that we do in Unicode?
- Some people don't need Unicode.
- Some people don't care.
- Some people do not realize that they will need Unicode support later.
- Some people do not understand Unicode.
- For some others, Unicode is a bit like accessibility for webapps: you start without it and add support for it later
- Many popular libraries / languages / applications do not have proper full Unicode support, not to mention sorting and normalization problems. And while all the elements of your development stack do not fully support Unicode, you cannot write a clean Unicode program.
The Internet clearly helps spread the Unicode trend. And this is good. Initiatives like Python3 succeeding one another help educate people about this issue. But we have to patiently wait a bit to see Unicode everywhere and new programmers instinctively, using Unicode instead of strings, where it matters.
For a joke, because FedEx does not seem to support international addresses,