How to combine Russian word in Unicode text using Perl?

I have a website on which I want to reuse, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian, and I want to pull out all the Russian words. Compliance with \w+does not work, and compliance with \p{L}+extracts everything.

How can I do it?

+3
source share
4 answers
perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

Well, that doesn't help!

First download a copy, this works:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}
+3
source

All of these answers are more complicated. Use it

$text =~/\p{cyrillic}/

bam.

+3
source

, :

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

I believe that the set of Cyrillic characters begins with 0x0400, and the set of characters in Cyrillic complement ends with 0x052F, so this should get a lot of words.

0
source

Just leave it here. Match a specific Russian word

use utf8;
...
utf8::decode($text);
$text =~ //;
-1
source

Source: https://habr.com/ru/post/1707457/


All Articles