How to remove characters that are not supported by utf8 MySQL character set?

How to remove characters from a string that are not supported by MySQL utf8 character set ? In other words, characters with four bytes, such as "ε", are only supported by the utf8mb4 MySQL character set .

For instance,

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰ 

should become

 C = -2.4‰ ± 0.3‰; H = -57 

I want to load a data file into a MySQL table with CHARSET=utf8 .

+6
source share
1 answer

MySQL utf8mb4 encoding is what the world calls UTF-8 .

MySQL utf8 is a subset of UTF-8 that only supports characters in BMP (meaning characters U + 0000 to U + FFFF inclusive).

Link

So, the following will correspond to the unsupported characters in question:

 /[^\N{U+0000}-\N{U+FFFF}]/ 

Here are three different methods you can use to clear input:

1: remove unsupported characters:

 s/[^\N{U+0000}-\N{U+FFFF}]//g; 

2: Replace unsupported U + FFFD characters:

 s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g; 

3: replace unsupported characters with a translation card:

 my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; 

For instance,

 use utf8; # Source code is encoded using UTF-8 use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8. use strict; use warnings; use 5.010; # say, // use charnames ':full'; # Not needed in 5.16+ my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); $_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰"; say; s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; say; 

Output:

 𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰ εC = -2.4‰ ± 0.3‰; εH = -57‰ 
+9
source

Source: https://habr.com/ru/post/1013940/


All Articles