I am extracting lines from an XML file, and although it should be pure UTF-8, it is not. My idea was to
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;
my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";
my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);
print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;
if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }
exits
$VAR1 = 'm gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mΓΒ¦gtig';
3
in theory, only the latin1 line will increase its length, but encoding already UTF-8 also makes it longer. Therefore, I cannot detect latin1 against UTF-8 this way.
Question
I would always like to get the UTF-8 string, but how can I determine if it is latin1 or UTF-8, so I only convert the latin1 string?
The ability to get yes / no if the UTF-8 string is just as useful.
source
share