Any way to detect and remove (or correct) bad characters resulting from unsuccessful encoding conversions

I am writing a parser. I took care of all the coding conversions for UTF-8 output correctly, but sometimes the source material is incorrect. such as or â€tm are the results of poor coding.

I know this is a long snapshot - but does anyone know a list of common strings caused by bad character conversions, or something else, so I don't need to create my own list.

Yes, I know that I am lazy, but somewhere I read, what makes me a good programmer?

+4
source share
1 answer

tl; dr: See the last two paragraphs.


I hate / love encoding issues.

We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U + 2019) . The byte sequence for this character is 0xE2 0x80 0x99 . On Windows-1252, this corresponds to the +, euro, and trademark symbol (& trade;). The “Tm” we see is a further transliteration of this trademark symbol in ASCII t and ASCII m, 0x74 0x6D , which makes our final damaged byte sequence 0xE2 0x80 0x74 0x6D .

Most likely, the actual + hat-euro-tm view is already in UTF-8. That is, + hat is a UTF-8 sequence, and the Euro symbol is also a UTF-8 sequence, because someone copied from a Windows-1252 document that was already incorrectly encoded and pasted into a UTF-8 document. You will find more bytes than only four from the original damage.

One way to solve this problem is to first translate the UTF-8 encoding of these characters into Windows-1252, and then treat this Windows-1252 string as UTF-8 when writing it.

You can use iconv with the //TRANSLIT flag for this purpose:

 $less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad); 

This tells iconv to try turning any characters that cannot be represented in Windows-1252 into something similar. This translation is imperfect and will destroy any legitimate UTF-8 characters that are not present in Windows-1252.

After you have the line Windows-1252, save it and submit to UTF-8. If all goes well, corruption should disappear, and you should not have any problems.

Yes, right.

In this particular case, the final byte of the correct sequence 0x99 was marked in two bytes with poor copy / paste. You will not get it back through the character set encoding the hoop.

While hoop jumps may work for some documents, you will surely find many things that are even more poorly transcoded. It would be best to perform a search and replace operation at the byte level, search for incorrectly encoded sequences and replace them with a simple alternative to ASCII or correctly encoded UTF-8. There are many ways that the encoding will be incorrect. For example, if the source of corruption was in the ISO-8859 family, the final damaged sequence would be different, or possibly the final & trade; cannot be confused in t and m in certain places.

Search and replacement at the byte level is guaranteed only to act on incorrectly encoded sequences and will not leave the risk of interruption on UTF-8 single-coded characters that cannot be represented in lower character sets. It is safer and faster.


edit: I really didn’t understand that you are already planning to do this .;) Unfortunately, I have never seen such a convenient list. Perhaps you should publish and publish your work so that others can take advantage. yourcharacterencodingsucks.com is available!

+5
source

Source: https://habr.com/ru/post/1342217/


All Articles