Perl - read file using encoding method?

im not too good when it comes to encoding, and I want to figure out how to return the data as the same encoding that it started with ...

I have a file with some characters, for example '»' , by the time I edited and entered into the database, which they turned into & Acirc; & raquo.

decode_entities () does nothing, and encode_entities encodes the characters again. Therefore, I created my own sub to fix this, but it assumes that when receiving data from a file, it is not extracted in the desired format.

 my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php"; { local( $/ ); # undefine the record seperator open FILE, "<", $file or die "Cannot open:$!\n"; my $fileContents = unicodeConvert(<FILE>); ... .. 

are there any encoding options, for example:

 my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php"; { local( $/ ); # undefine the record seperator open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8"; my $fileContents = unicodeConvert(<FILE>); ... .. 

and my sub:

 sub unicodeConvert($) { my $str = shift; my %entityRef = ("&" => "&amp;", '¢' => "&cent;", '¤' => "&curren;", '¦' => "&brvbar;", '¨' => "&uml;", 'ª' => "&ordf;", '¬' => "&not;", '®' => "&reg;", '°' => "&deg;", '²' => "&sup2;", '´' => "&acute;", '¶' => "&para;", '¸' => "&cedil;", 'º' => "&ordm;", '¼' => "&frac14;", '¾' => "&frac34;", 'À' => "&Agrave;", 'Â' => "&Acirc;", 'Ä' => "&Auml;", 'Æ' => "&AElig;", 'È' => "&Egrave;", 'Ê' => "&Ecirc;", 'Ì' => "&Igrave;", 'Î' => "&Icirc;", 'Ð' => "&ETH;", 'Ò' => "&Ograve;", 'Ô' => "&Ocirc;", 'Ö' => "&Ouml;", 'Ø' => "&Oslash;", 'Ú' => "&Uacute;", 'Ü' => "&Uuml;", 'Þ' => "&THORN;", 'à' => "&agrave;", 'â' => "&acirc;", 'ä' => "&auml;", 'æ' => "&aelig;", 'è' => "&egrave;", 'ê' => "&ecirc;", 'ì' => "&igrave;", 'î' => "&icirc;", 'ð' => "&eth;", 'ò' => "&ograve;", 'ô' => "&ocirc;", 'ö' => "&ouml;", 'ø' => "&oslash;", 'ú' => "&uacute;", 'ü' => "&uuml;", 'þ' => "&thorn;", '¡' => "&iexcl;", '£' => "&pound;", '¥' => "&yen;", '§' => "&sect;", '©' => "&copy;", '«' => "&laquo;", '¯' => "&macr;", '±' => "&plusmn;", '³' => "&sup3;", 'µ' => "&micro;", '·' => "&middot;", '¹' => "&sup1;", '»' => "&raquo;", '½' => "&frac12;", '¿' => "&iquest;", 'Á' => "&Aacute;", 'Ã' => "&Atilde;", 'Å' => "&Aring;", 'Ç' => "&Ccedil;", 'É' => "&Eacute;", 'Ë' => "&Euml;", 'Í' => "&Iacute;", 'Ï' => "&Iuml;", 'Ñ' => "&Ntilde;", 'Ó' => "&Oacute;", 'Õ' => "&Otilde;", '×' => "&times;", 'Ù' => "&Ugrave;", 'Û' => "&Ucirc;", 'Ý' => "&Yacute;", 'ß' => "&szlig;", 'á' => "&aacute;", 'ã' => "&atilde;", 'å' => "&aring;", 'ç' => "&ccedil;", 'é' => "&eacute;", 'ë' => "&euml;", 'í' => "&iacute;", 'ï' => "&iuml;", 'ñ' => "&ntilde;", 'ó' => "&oacute;", 'õ' => "&otilde;", '÷' => "&divide;", 'ù' => "&ugrave;", 'û' => "&ucirc;", 'ý' => "&yacute;", 'ÿ' => "&yuml;"); while( ( my $key, my $obj ) = each( %entityRef ) ) { if( $key ne '&' ) { $str =~ s/$key/$obj/gis } else { $str =~ s#&((?!(quot;)|(amp;)|(cent;)|(curren;)|(brvbar;)|(uml;)|(ordf;)|(not;)|(reg;)|(deg;)|(sup2;)|(acute;)|(para;)|(cedil;)|(ordm;)|(frac14;)|(frac34;)|(Agrave;)|(Acirc;)|(Auml;)|(AElig;)|(Egrave;)|(Ecirc;)|(Igrave;)|(Icirc;)|(ETH;)|(Ograve;)|(Ocirc;)|(Ouml;)|(Oslash;)|(Uacute;)|(Uuml;)|(THORN;)|(agrave;)|(acirc;)|(auml;)|(aelig;)|(egrave;)|(ecirc;)|(igrave;)|(icirc;)|(eth;)|(ograve;)|(ocirc;)|(ouml;)|(oslash;)|(uacute;)|(uuml;)|(thorn;)|(iexcl;)|(pound;)|(yen;)|(sect;)|(copy;)|(laquo;)|(macr;)|(plusmn;)|(sup3;)|(micro;)|(middot;)|(sup1;)|(raquo;)|(frac12;)|(iquest;)|(Aacute;)|(Atilde;)|(Aring;)|(Ccedil;)|(Eacute;)|(Euml;)|(Iacute;)|(Iuml;)|(Ntilde;)|(Oacute;)|(Otilde;)|(times;)|(Ugrave;)|(Ucirc;)|(Yacute;)|(szlig;)|(aacute;)|(atilde;)|(aring;)|(ccedil;)|(eacute;)|(euml;)|(iacute;)|(iuml;)|(ntilde;)|(oacute;)|(otilde;)|(divide;)|(ugrave;)|(ucirc;)|(yacute;)|(yuml;)|(nbsp;)))#$obj#gis; } } return $str; } 
+4
source share
2 answers

As noted in the commentary on your question, I'm not sure what exactly you are asking.

So, I assume that you are trying to convert Unicode characters to HTML objects. In this case, the use of one of the ready-made modules should be better. If this does not work due to coding problems (which are quite complicated in Perl), then the answer to your question:

Are there any encoding options, for example

 open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8"; 

... will probably solve it, and it will probably make your own attempt to work, but it is better to use a ready-made one ;-) (by the way, the way you wrote it was like "UTF-8" to die , which made it a little difficult to understand what you asked for ;-)

Yes, there is a UTF-8 option if you have recent perl (> = v5.8):

 open(my $fh,'<:encoding(UTF-8)', $file) or die "Error opening $file: $!"; 

(example adapted from perluniintro )

You can also use binmode to change an already open file descriptor (e.g. STDIN / OUT).

 binmode(STDOUT, ":encoding(UTF-8)"); 

You can also set the default encoding with open pragma.

But for this, I suggest trying binmode or changing your open line to see if it allows it.

If you have perl less than v5.8, everything will be more complicated, but it may be solvable if you tell us the version.

A few other things I noticed, by the way:

  • It doesn't matter, but it's better to use a lexically limited file descriptor ( my $fh instead of FILE ).
  • When you put a newline in the die line, it suppresses the line number information that is usually added to help you find the problem.
  • If you put the name of a file that cannot be opened (or SQL that failed, or something else) in the die message, it will be easier to debug.
  • Do not use subprototypes in Perl (5) : ( sub unicodeConvert($) ). Do not put $ / @ / % , etc. there. He doesn’t just check things, he can change the meaning in a confusing way. This is only necessary to create new "inline style" statements.
+4
source

I suspect that there is a difference in the encodings of your terminal (which may be UTF-8) and the source code of your perl script (which you may be editing in some editor encoded in 8859-1). If you are sure that your terminal and your source code are in the same encoding, try use utf8; in your script header (see man perlunicode ). If this does not help, try printing out the data that is stored in your database (increasing the debug logging for DBI) (maybe not necessary since you are not storing the data as UTF8). As a rule, try to provide:

  • The code page of your terminal ( locale ), if you run your script for the terminal (or the locale of the system that your server uses, if you run it, for example, apache)
  • Source Code Encoding.
  • MySQL connection code page (do you release SET NAMES 'utf8' ?)

Also, for HTML encoding, it may be easier for you to reuse HTML::Entities::decode() / HTML::Entities::encode() instead of implementing it yourself.

+1
source

Source: https://habr.com/ru/post/1300602/


All Articles