As much as possible, is it better to use ISO-8859-1 rather than UTF-8?

To globalize scripts, UTF-8 very often used as the default encoding; e.g. in HTML or mysql default encoding. This also applies to the Latin website, in which the characters are in class ISO-8859-1 . Isn't it advantageous to use ISO-8859-1 when UTF-8 characters are not needed. From profitable, I mean critically beneficial.

My point is that only 0 - 127 characters of UTF-8 are 1 byte, and from 128 - 255 - 2 bytes; where ISO-8859-1 is 1 byte system. Does it play a crucial role in database storage?

+4
source share
4 answers

Most of these 127 UTF-8 1-byte characters are most commonly used when working with ISO-8859-1 . Let's take a look here . If you use UTF-8 , you will need 1 extra byte only if you use one of 127-255 characters ( not , so commons I bet ).

My opinion? Use UTF-8 if you can, and if you have no problems handling it. When you save the day, you will need additional characters (or the day when you need to translate your content) really costs a few extra bytes here and there in the database ...

+4
source

If all you need now and forever is ISO-8859-1, you will save space by using it, although probably not so much if most of the characters used are <128. If you ever need to use something outside of ISO-8859-1, you will find yourself in a world of resentment. From a general point of view, the storage cost for UTF-8 is much lower than the cost of implementing several encodings.

+5
source

Short answer: It does not matter.

Long answer: think about it. You have a message table containing forum posts. You have a lot of posts (say 1 million). Suppose each message takes 10 extra bytes due to UTF-8. These are 10 million additional characters, which is not even 10MB (not counting the index).

For such a "popular" forum, you will no longer use more than 15 MB. It's nothing. You do not have to worry about excess lost bytes, and UTF-8 will provide benefits that are much more important than 10 MB.

+3
source

Does size matter?

As you know, characters in the range U + 0080 to U + 009F occupy twice as much space in UTF-8 as in ISO-8859-1. But how often are these characters used?

In a typical Spanish text, I got from the first page of Wikipedia:

Artículo bueno

Las Vegas de la de la de Seville Simpson fue emitida originalmente por la cadena Fox entre el 17 de septiembre de 1995 y el 19 de mayo de 1996. Los productores ejecutivos de la séptima temporada fueron Bill Oakley and Josh Weinstein, quienes producirían 21 episodios de la temporada. David Mirkin Few El Show runner de los cuatro restantes, incluyendo dos vestigios que habían sido producidos para la temporada front. La séptima temporada estuvo nominada para dos Premios Primetime Emmy, incluyendo la categoría "Mejor programa animado (de duración menor a una hora)" y obtuvo un Premio Annie por "Mejor programa animado de televisión". for versioón en DVD fue lanzada a la venta en la Región 1 el 13 de diciembre de 2005, en la Región 2 el 30 de enero de 2006 y en la Región 4 el 29 de marzo del mismo año. La caja recopilatoria fue puesta a la venta en dos formatos diferentes: una caja con la forma de la cabeza de Marge y otra rectangular clásica, en la cual el dibujo muestra el estreno de una película.

At sea, there are 17 non-ASCII characters out of 1,044 ASCII characters. This means an extension of only 1.6% of the extension when encoding in UTF-8. It is hardly worth worrying, especially when all-ASCII HTML markup is considered.

(However, the difference may be significant for a more accentuated language such as Sango .)

How would your idea work?

Are you going to encode all your data in windows-1252? It does not give you globalization; the globe does not stop on the Oder River. True ISO-8859-1 (missing the euro;) is even worse; the globe does not stop on the English Channel.

Tag text with its encoding? This works for XML, HTML and SMTP. But you asked:

Does it play a crucial role in database storage?

How are you going to store mixed Latin-1 and UTF-8 strings in a database?

There are two columns of EncodedText BLOB, IsUtf8 BOOLEAN ? How are you going to ask this? Of course, you will not just look at EncodedText and ignore IsUtf8 ; this approach leads to mojibake.

You can write a view with a CASE WHEN IsUtf8 THEN EncodedText ELSE Latin1ToUtf8(EncodedText) END and the correct INSTEAD OF INSERT trigger, but it can cost you more bytes than it saves.

+1
source

Source: https://habr.com/ru/post/1385009/


All Articles