The best way to correct corrupted data caused by false coding

I have a dataset that contains garbled text fields due to coding errors during many import / export operations from one database to another. Most errors were caused by converting UTF-8 to ISO-8859-1. Oddly enough, the errors are incompatible: the word " München " appears in some places as " München ", as well as " MÜnchen " elsewhere.

Is there a trick on the SQL server to fix this type of crap? The first thing I can think of is to use the COLLATE , so that ü is interpreted as ü , but I don’t know exactly how to do it. If this cannot be done at the DB level, do you know any tool that helps for mass correction? (there is no manual search / replace tool, but a tool that somehow guesses the distorted text and corrects them)

+4
source share
4 answers

I was in the same position. The created MySQL server was configured on latin1, the old data was latin1, the new data was utf8, but stored in latin1 columns, then utf8 columns were added ... Each row can contain any number of encodings.

The big problem is that there is no single solution that fixes everything, because many legacy encodings use the same bytes for different characters. This means that you have to resort to heuristics. In my Utf8Voodoo class there is a huge array of bytes from 127 to 255, aka - obsolete single-byte encodings of non-ASCII characters.

 // ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the // site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable // characters in 0x80-0x9F), but mb_detect_encoding never returns that // encoding when ISO-8859-* is in the detect list, so we cannot use it. // CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs // just a few times, but enough to make it pretty much impossible to // automagically detect exactly which non-ISO encoding was used. Hence the // need for "likely bytes" in addition to the "magic bytes" below. /** * This array contains the magic bytes that determine possible encodings. * It works by elimination: the most specific byte patterns (the array's * keys) are listed first. When a match is found, the possible encodings * are that entry value. */ public static $legacyEncodingsMagicBytes = array( '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'), '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'), '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'), ); /** * This array contains the bytes that make it more likely for a string to * be a certain encoding. The keys are the pattern, the values are arrays * with (encoding => likeliness score modifier). */ public static $legacyEncodingsLikelyBytes = array( // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0x80 | - | - | € | Ç '/\x80/' => array( 'Windows-1252' => +10, ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0x93 | - | - | " | ô // 0x94 | - | - | " | ö // 0x95 | - | - | • | ò // 0x96 | - | - | – | û // 0x97 | - | - | — | ù // 0x99 | - | - | ™ | Ö '/[\x93-\x97\x99]/' => array( 'Windows-1252' => +1, ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0x86 | - | - | † | å // 0x87 | - | - | ‡ | ç // 0x89 | - | - | ‰ | ë // 0x8A | - | - | Š | è // 0x8C | - | - | Œ | î // 0x8E | - | - | Ž | Ä // 0x9A | - | - | š | Ü // 0x9C | - | - | œ | £ // 0x9E | - | - | ž | × '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array( 'Windows-1252' => -1, ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0xA4 | ¤ | € | ¤ | ñ '/\xA4/' => array( 'ISO-8859-15' => +10, ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0xA6 | ¦ | Š | ¦ | ª // 0xBD | ½ | œ | ½ | ¢ '/[\xA6\xBD]/' => array( 'ISO-8859-15' => -1, ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0x82 | - | - | ‚ | é // 0xA7 | § | § | § | º // 0xFD | ý | ý | ý | ² '/[\x82\xA7\xCF\xFD]/' => array( 'CP850' => +1 ), // Byte | ISO-1 | ISO-15 | W-1252 | CP850 // 0x91 | - | - | ' | æ // 0x92 | - | - | ' | Æ // 0xB0 | ° | ° | ° | ░ // 0xB1 | ± | ± | ± | ▒ // 0xB2 | ² | ² | ² | ▓ // 0xB3 | ³ | ³ | ³ | │ // 0xB9 | ¹ | ¹ | ¹ | ╣ // 0xBA | º | º | º | ║ // 0xBB | » | » | » | ╗ // 0xBC | ¼ | Œ | ¼ | ╝ // 0xC1 | Á | Á | Á | ┴ // 0xC2 |  |  |  | ┬ // 0xC3 | à | à | à | ├ // 0xC4 | Ä | Ä | Ä | ─ // 0xC5 | Å | Å | Å | ┼ // 0xC8 | È | È | È | ╚ // 0xC9 | É | É | É | ╔ // 0xCA | Ê | Ê | Ê | ╩ // 0xCB | Ë | Ë | Ë | ╦ // 0xCC | Ì | Ì | Ì | ╠ // 0xCD | Í | Í | Í | ═ // 0xCE | Î | Î | Î | ╬ // 0xD9 | Ù | Ù | Ù | ┘ // 0xDA | Ú | Ú | Ú | ┌ // 0xDB | Û | Û | Û | █ // 0xDC | Ü | Ü | Ü | ▄ // 0xDF | ß | ß | ß | ▀ // 0xE7 | ç | ç | ç | þ // 0xE8 | è | è | è | Þ '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array( 'CP850' => -1 ), /* etc. */ 

Then you iterate over bytes (not characters) in the strings and save points. Let me know if you would like more information.

+4
source

Download iconv - you can get binaries for Win32 as well as Unix / Linux. This is a command line tool that accepts the source file and after specifying the input encodings and output encodings will make the necessary conversion for you to STDOUT.

I use this very heavily to convert from UTF-16 (as output from SQL Server 2005 export files) to ASCII.

You can download here: http://gnuwin32.sourceforge.net/packages/libiconv.htm

+1
source

Given the complexity of the data (multiple encodings on one line / record). I think you will have to export / upload the data and then do the processing with that.

I think the best way is a sequence of manual replacements. Perhaps some spelling correction code may find all errors - then you can add an explicit correction code. Then try again until the spelling check stops finding errors?

(Obviously add the correct words to the spellchecker).

0
source

Take a look at https://github.com/LuminosoInsight/python-ftfy - it does a good job of heuristic correction, which takes care of a pretty ugliness than what you would expect when you look at small samples of your data.

0
source

Source: https://habr.com/ru/post/1303567/


All Articles