MySQL utf8mb4
encoding is what the world calls UTF-8
.
MySQL utf8
is a subset of UTF-8
that only supports characters in BMP (meaning characters U + 0000 to U + FFFF inclusive).
Link
So, the following will correspond to the unsupported characters in question:
/[^\N{U+0000}-\N{U+FFFF}]/
Here are three different methods you can use to clear input:
1: remove unsupported characters:
s/[^\N{U+0000}-\N{U+FFFF}]//g;
2: Replace unsupported U + FFFD characters:
s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
3: replace unsupported characters with a translation card:
my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
For instance,
use utf8; # Source code is encoded using UTF-8 use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8. use strict; use warnings; use 5.010; # say, // use charnames ':full'; # Not needed in 5.16+ my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); $_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰"; say; s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; say;
Output:
𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰ εC = -2.4‰ ± 0.3‰; εH = -57‰
source share