How to combine Russian word in Unicode text using Perl?

Question

How to combine Russian word in Unicode text using Perl?

I have a website on which I want to reuse, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian, and I want to pull out all the Russian words. Compliance with \w+does not work, and compliance with \p{L}+extracts everything.

How can I do it?

+3

regex perl unicode

mike May 01, '09 at 2:25

source share

4 answers

All of these answers are more complicated. Use it

$text =~/\p{cyrillic}/

bam.

+3

Karel bílek Feb 05 '13 at 15:30

source share

, :

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

I believe that the set of Cyrillic characters begins with 0x0400, and the set of characters in Cyrillic complement ends with 0x052F, so this should get a lot of words.

0

Chas. Owens May 01, '09 at 2:38

source share

Just leave it here. Match a specific Russian word

use utf8;
...
utf8::decode($text);
$text =~ //;

-1

dezhik Jan 21 '15 at 10:41

source share

Bron gondwana · Accepted Answer · 2009-05-01T03:08:19+0000

perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

Well, that doesn't help!

First download a copy, this works:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}

How to combine Russian word in Unicode text using Perl?

More articles: