I would like to understand how a smart quote from a window turns into "Γ’ € β„’"

Here's the workflow:

  • user types in Word; The word changes one apostrophe to "smart quote"
  • the user inserts a test from a word into a form on a web page; the page in which the form is located is encoded in UTF-8
  • data is stored in a latin1 encoded MySQL database
  • when retrieving from a database using a PHP application (which involves encoding a UTF-8 database) and displayed on a UTF-8 web page, the quote is displayed as "

I understand that there is a mismatch between the encoding of the input and output pages and the database. I will fix it.

Should a character survive a trip to and from the database?

And how does one character (0x92, if I'm not confused) go through this process and give the other end as three characters?

Can someone tell me what happens with bytes at every step of the process?

+4
source share
1 answer

Step 1:

Word converts ' to ' (Unicode code U+2019 , RIGHT SINGLE QUOTATION MARK ).

Step 2:

' encoded in UTF-8 as E2 80 99

Step 3 :

It seems that the problem is arising. It looks like the UTF-8 string is saved without conversion to a MySQL field encoded in the Latin alphabet:

E2 80 99 in Latin-1 is Ò€ℒ .

Step 4:

Either here, or in the previous step, the falsely used string latin-1 is converted to UTF-8.

Ò€ℒ at UTF-8 - C3 A2 E2 82 AC E2 84 A2 .

This will appear on the UTF-8 encoded website as Ò€ℒ .

+9
source

Source: https://habr.com/ru/post/1435043/


All Articles