Forming mixed multiline string ISO-8859-1 and UTF-8 in UTF-8 in Perl

Consider the following task:

A multi-line string $junkcontains some lines that are encoded in UTF-8, and some in ISO-8859-1. I do not know a priori which strings are encoded, so heuristics are required.

I want to turn $junkinto pure UTF-8 with the correct recoding of ISO-8859-1 strings. In addition, in case of errors in processing, I want to provide a "result of the best result" and not throw an error.

My current attempt is as follows:

$junk = force_utf8($junk);

sub force_utf8 {
  my $input = shift;
  my $output = '';
  foreach my $line (split(/\n/, $input)) {
    if (utf8::valid($line)) {
      utf8::decode($line);
    }
    $output .= "$line\n";
  }
  return $output;
}

Obviously, the conversion will never be perfect, since we lack information about the source encoding of each line. But is this the “best effort result” we can get?

/ force_utf8(...) sub?

+3
5
+2

, , . , Ã © ISO-8859-1; , UTF-8.

, , , , .

.

+2

, , ISO-8859-1 UTF-8. , 8- , MSb . , , UTF-8. UTF-8, , , ISO-8859-1. , ISO-8859-1, UTF-8; $junk , .

+1

this. UTF-8 8 , 8- . , , . , .

+1

, "file -bi" "iconv -f ISO-8859-1 -t UTF-8".

. ISO-8859-1, UTF-8 ASCII. , wile , , , , .

Perl, UTF-8 ISO-8859-1, UTF-8.

, , , ( - , 1-2 ISO-8859-1)

№1 ISO-8859-1 UTF-8

cat mixed_text.txt |
while read i do
type=${"$(echo "$i" | file -bi -)"#*=}
if [[$ type == 'iso-8859-1']]; then
    echo "$ i" | iconv -f ISO-8859-1 -t UTF-8
else
    echo "$ i"
fi
done> utf8_text.txt

Option No. 2 is converted to ISO-8859-1 in ASCII

cat mixed_text.txt |
while read i do
type = $ {"$ (echo" $ i "| file -bi -)" # * =}
if [[$ type == 'iso-8859-1']]; then
    echo "$ i" | iconv -f ISO-8859-1 -t ASCII // TRANSLIT
else
    echo "$ i"
fi
done> utf8_text.txt
0
source

Source: https://habr.com/ru/post/1739343/


All Articles