How can I change my regex to read UTF-8?

Question

How can I change my regex to read UTF-8?

I got very far in the script. I only work to find out that he has problems reading UTF-8 characters.

I have a contact in Sweden who made a VM on his machine with some UTF-8 in it, and when my script hit VM lost his mind, but he was able to read all the other virtual machines that are in the "normal" encoding.

Anyway, maybe my code will make more sense.

#!/usr/bin/perl use strict; use warnings; #use utf8; use Net::OpenSSH; # Create a hash for storing the options needed by Net::OpenSSH my %ssh_options = ( port => '22', user => 'root', password => 'password' ); # Create a new Net::OpenSSH object my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options); # Create an array and capture the ESX\ESXi output from the current server my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms'); shift @getallvms; # Process data gathered from server foreach my $vm (@getallvms) { # Match ID, NAME $vm =~ m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm; my $id = "$+{id}"; my $name = "$+{name}"; print "$id\n"; print "$name\n"; print "\n"; }

I narrowed it down to my regular expression as a problem, because it uses raw server output before the regular expression.

 416 TEST Box åäö!"''*#

And this is what I get after applying my regular expression

 416 TEST

For some reason, the regex doesn't match, I just don't know why. And the current regular expression in the example is the third attempt to make it work.

The FULL field that I map looks like this. The way my regex was done was that I only need the first two blocks of information, the expression that you want to copy the entire line.

Code:

 432 TEST Box åäö!"''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04

+4

regex perl utf-8

ianc1215 Feb 08 '11 at 13:37

source share

4 answers

If you know which line you are working on, UTF-8 and Net :: OpenSSH do not (and therefore do not mark it as such), you can convert it to an internal representation, Perl can work with one of

 use Encode; decode_utf8( $in_place ); $decoded = decode_utf8( $raw );

+4

Jb. Feb 08 '11 at 14:01

source share

So, you will see that Perl understands these names as UTF-8 encoded strings. Until now, I do not think so. Full overview of UTF-8 in Perl .

You can test the strings with the same name with Encode::is_utf8 and decode them with Encode::decode('UTF-8', $your_string) .

UTF-8 is pretty confusing in Perl, IMHO. You must be very patient.

To print UTF-8 lines pretty nicely, you should use something like this in the script:

 BEGIN { binmode(STDOUT, ':encoding(UTF-8)'); binmode(STDERR, ':encoding(UTF-8)'); # Error messages }

If you understand Perl your UTF-8 names, you can also rename them correctly.

+3

wk Feb 08 '11 at 14:05

source share

Recent versions of Net :: OpenSSH have built-in support for encoding / decoding encoding in capture methods:

 my @getallvms = $ssh->capture({stream_encoding => 'utf8'}, 'vim-cmd vmsvc/getallvms');

+3

salva Mar 29 '11 at 15:57

source share

Greg bacon · Accepted Answer · 2011-02-08T15:04:19+0000

Subpattern

 (?<name> .+?)\s+

in your regex, it means "matching and remembering one or more characters other than the new line," but stop as soon as you find spaces, "so $name contains TEST because the pattern no longer matches when it sees the space in front of Box .

The VI Toolkit wiki gives an example Output from the getallvms subcommand:

  # vmware-vim-cmd -H 10.10.10.10 -U root -P password / vmsvc / getallvms
 Vmid Name File Guest OS Version Annotation
 64 bartPE [store] BartPE / BartPE.vmx winXPProGuest vmx-04
 96 trustix [store] Trustix / Trustix.vmx otherLinuxGuest vmx-04

The case is slightly different from the example in your question, but it seems that we can search [store] as a bumper to match:

 /^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix

Unwanted quantifier +? means matching one or more of them, but the match wants to transfer control to the others as quickly as possible. Remember that [ has special meaning in regular expressions, but the pattern \[ matches a literal, rather than introducing a character class.

I think of this method as extradition or pulling and pulling. If you want to extract a piece of text that is difficult to characterize, look at the surrounding functions that are easy to match - often as simple as ^ or $ . Then use a stretch pattern to capture everything in between, usually (.+) Or (.+?) . Read the “Quantifiers” section of the perlre documentation for an explanation of the many options.

This fixes the immediate problem, and you can also add varnish in several areas.

Do not use $1 , $2 and friends unconditionally! Always verify pattern matching before using capture variables. for instance

 if (/(foo|bar|baz)/) { print "got $1\n"; } else { print "no match\n"; }

Unprotected print $1 can produce unexpected results that are difficult to debug.

The wise use of default values for Perl can help highlight the calculation and allow the mechanism to fade in the background. Dropping $vm in favor of $_ , since the implicit loop variable and the implicit match result make a more enjoyable result.

Your comments are simply translated from Perl into English. The most useful comments explain why, not what. Also keep in mind Rob Pike's advice on comments :

If your code needs a comment that needs to be understood, it’s better to rewrite it to make it easier to understand.

In %+ assignments, quotation marks do nothing useful. The values are already strings, so remove the quotation marks.

 my $id = $+{id}; my $name = $+{name};

Below is a modified version of your code that captures everything after the number, but before [store] in $name . utf8 pragma announces that your source code, and not as a common mistake, your input - contains UTF-8. The test below simulates using a canned echo output from vim-cmd to the Swedish VM.

As Tom suggested, I use the Encode module to decode the output that comes through the SSH connection and encode it to the local host before printing it.

The perlunifaq documentation recommends decoding external data into the internal Perl format, and then encoding any output immediately before writing it. I assume that the value returned from $ssh->capture(...) uses UTF-8 encoding, i.e. the remote host sends UTF-8. We see the expected result, because I use the modern Linux distribution and ssh-ing back to it, but in the wild you can deal with some other encoding.

You can get away with skipping calls to decode and encode , because the internal Perl format matches the ones you use. In general, however, cutting corners can cause you problems:

Finally, the code!

 #! /usr/bin/env perl use strict; use utf8; use warnings; use Encode; use Net::OpenSSH; my %ssh_options = (); my $ssh = Net::OpenSSH->new('localhost', %ssh_options); # Create an array and capture the ESX\ESXi output from the current server #my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms'); my @getallvms = $ssh->capture(<<EOEcho); echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04' EOEcho shift @getallvms; for (@getallvms) { $_ = decode "utf8", $_, Encode::FB_CROAK; if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) { my $id = $+{id}; my $name = $+{name}; print encode("utf8", $id), "\n", encode("utf8", $name), "\n", "\n"; } else { print "no match\n"; } }

Output:

  416
 TEST Box åäö! "'' * #

How can I change my regex to read UTF-8?

More articles: