Why does [^ \ w] match some word characters, but not [^ \ p {Word}]?

Question

Why does [^ \ w] match some word characters, but not [^ \ p {Word}]?

I wrote a Perl script that prints characters matching the Unicode property. Everything seems to be working so far.

But he prints ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ ÿamong the characters matching [^\w]. These characters should rather match \w. Oddly enough, they match \p{Word}.

I tried without success:

map { decode ( "UTF-8", $_ ) }
map { pack 'U0C*', unpack 'C*', $_ }

How can I make [^\w]words that do not correspond to these characters?

chars.pl

#!/usr/bin/perl

use warnings;
use strict;
use utf8;

binmode STDOUT, ':utf8';

my $c;
my $cols = 80;
my $arg = shift;
my $regex = qr/$arg/;

for ( map { chr } 0x20 .. 0xFFFF )
{
  next if /\p{Unassigned}|\p{NChar}|\p{Cs}/;

  if ( $_ =~ $regex )
  {
    print STDOUT;
    print STDOUT "\n" if ++$c % $cols == 0;
  }

}

print STDOUT "\n" if defined $c and $c % $cols != 0;
exit 0;

Good:

$ ./chars.pl '\p{Cyrillic}'
ЀЂЃЄЅІЇЈЉЊЋЌЍЎЏ
ѐђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҇ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡ
ҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱ
ӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧᴫᵸⷠⷡⷢⷣⷤⷥⷦⷧⷨⷩⷪⷫⷬⷭⷮⷯⷰⷱⷲⷳⷴⷵⷶⷷ
ⷸⷹⷺⷻⷼⷽⷾⷿꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꙮ꙯꙰꙱꙲꙳꙼꙽꙾ꙿꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏ
ꚐꚑꚒꚓꚔꚕꚖꚗ
$

Good:

$ ./chars.pl '[^\p{Word}]' | grep É
$

Poorly:

$ ./chars.pl '[^\w]' | grep É
°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
$

Perl v5.14.2

+4

perl character-class

nr Aug 4 '16 at 15:16

source share

1 answer

Markus Laire · Accepted Answer · 2016-08-04T15:38:45+0000

Perl's Unicode support is a huge topic, see for example this answer

\w \p{Word}, /u ( Perl 5.14).

-

use v5.14;

( ) unicode_strings /u. :

use feature 'unicode_strings';

- /u, , .

perlre manpage. /d, /u, /a /l.

\w perlrecharclass manpage.

Why does [^ \ w] match some word characters, but not [^ \ p {Word}]?

chars.pl

More articles: