Perl: How to combine FULLWIDTH LATIN SMALL

I use listadmin to manage many mailman-based mailing lists. I have a long list of topics and addresses configured to block spam. I recently got smarter spam in the sense that it uses pretty Unicode characters, for example:

Subject: Al l the ad ult movies, you see nare nothing c ompari- ng to our assembly exx xci t i ng of 13,000 mods in HD t hat v vilil for y ou now!

or

Subject: HD qua lit y vi d eos ad pho to graph sof ho tc hic ks
here for u

Now I want to use Perl's smart regex to block this. The piping of these items in hexdump showed that many characters are FULLWIDTH LATIN SMALL LETTER . However, \p{FULLWIDTH LATIN SMALL LETTER} does not work: Can't find Unicode property definition "FULLWIDTH LATIN SMALL LETTER"

So the question is: is there \p{something} to match these full-width characters? Alternatively: is there another way to match these characters?

+6
source share
2 answers

On the perlunicode page, available Unicode character classes are available. I found it as a reference in perlrebackslash, which documents special character classes and backslash sequences, such as \p{...} in regular expressions.

The summary is that for all but the most common property classes, the property type and property value are required, separated by the symbol : or = . However, it appears that the full-width character is not mentioned as a predefined property.

But there is a Block / Blk property that can have Halfwidth and Fullwidth Forms ( U+FF00 - U+FFEF ) as the value:

 /\p{Block=Halfwidth and Fullwidth Forms}/ 

This will match your input (tested on v16.3).


A useful tool for this is uniprops .

 $ uniprops U+FF41 U+FF41 ‹a› \N{FULLWIDTH LATIN SMALL LETTER A} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InHalfwidthAndFullwidthForms Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase Halfwidth_And_Fullwidth_Forms Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word X_POSIX_XDigit 

As you can see, \p{Block=Halfwidth and Fullwidth Forms} can also be written \p{In Halfwidth and Fullwidth Forms} .

+8
source

You can use charnames::viacode to get character names from your codes:

 #!/usr/bin/perl use warnings; use strict; use utf8; use charnames qw(); my $string = q(Subject: οΌ‘l l ο½”he ad ult mov ies you' ve see nare nothing ) .q(c ompari- ng to our exx xci ti ng compilation of 1οΌ“' 000 ) .q(mov ies in HD t hat arο½… av ailable for y ou now!); my $count = grep /FULLWIDTH/, map charnames::viacode(ord), split //, $string; print "$count fullwidth characters.\n"; 
+4
source

Source: https://habr.com/ru/post/944657/


All Articles