Arabic character encoding: UTF-8 compared to Windows-1256

Fast background . I inherited a large sql dump file containing a combination of English and Arabic text, and (I think) it was originally exported using "latin1". I changed all occurrences of "latin1" to "utf8" before importing the file. Arabic text did not display correctly in phpmyadmin (which I assume is normal), but when I uploaded the text to a web page with the following ...

<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/> 

... everything looked good, and the Arabic text displayed perfectly.


Problem : my client is really really very picky and doesn't want to change it ...

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 

... the equivalent of "Windows-1256". I did not think this would be a problem, but when I changed the encoding value to "UTF-8", all Arabic characters appeared like diamonds with question marks. Should UTF-8 correctly display Arabic text?


Here are some notes about my database configuration:

  • Database Encryption - "utf8"
  • The combination of database connections is "utf8_general_ci"
  • All databases, tables, and related fields were mapped as "utf8_general_ci"

I cleared the stack overflow and other forums for everything related to my problem. I found similar problems, but no solutions seem to work for my specific situation. Hope someone can help!

+6
source share
4 answers

If the document looks correct, if it is declared as encoded windows-1256, then most likely it is encoded by Windows-1256. Therefore, apparently, it was not exported using latin1, which would be impossible, since latin1 does not have Arabic letters.

If this is just one file, the easiest way is to convert it from windows-1256 to utf-8, using, for example, Notepad ++ . (Open the file in it, change the encoding in the File Format menu in Arabic, Windows-1256. Then select "Convert to UTF-8" in the "File Format" menu and select "File" → "Save".)

Windows-1256 and UTF-8 are completely different encodings, so the data gets corrupted if you declare Windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

+3
source

We cannot find an error in your code if you do not show us your code, so we are very limited in how we can help you.

You told the browser to interpret the document as UTF-8, not Windows-1256, but have you really changed the encoding used from Windows-1256 to UTF-8?

For instance,

 $ cat a.pl use strict; use warnings; use feature qw( say ); use charnames ':full'; my $enc = $ARGV[0] or die; binmode STDOUT, ":encoding($enc)"; print <<"__EOI__"; <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=$enc"> <title>Foo!</title> </head> <body dir="rtl"> \N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA} </body> </html> __EOI__ $ perl a.pl UTF-8 > utf8.html $ perl a.pl Windows-1256 > cp1256.html 
+2
source

I think you need to get back to square one. It looks like you have a Win-1256 encoded database dump, and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP, but you have a lot of unnecessary tags for your question and missing the most important, PHP.

First you need to convert the text dump to UTF-8, and you can do it with PHP. Most likely, your script conversion will have two steps: first read the Win-1256 bytes and decode them into Unicode internal text strings, then encode Unicode text strings into UTF-8 bytes for output to a new text file.

Once you have done this, re-import the database as before, but now you have correctly encoded the input as UTF-8.

After that, it should be as simple as reading a database and rendering a web page with the correct UTF-8 encoding.

PS In fact, you can transcode the data every time you show it, but this does not solve the problem of having a database filled with incorrectly encoded data.

+2
source

To correctly display Arabic characters, you need to convert your php file to utf-8 without Bom, this happened to me, Arabic characters were displayed in diamonds, but converting to utf-8 without bom will solve this problem.

0
source

Source: https://habr.com/ru/post/904705/


All Articles