How to match words from a list in a huge case using regexp (in a Perl terminal or * nix)?

Question

How to match words from a list in a huge case using regexp (in a Perl terminal or * nix)?

from this list of nouns in the .txt file, where nouns are separated by new lines, such as:

hooligan
football
brother
bollocks

... and a separate .txt file containing a series of regular expressions, separated by new lines, for example:

[a-z]+\tNN(S)?
[a-z]+\tJJ(S)?

... I would like to run regular expressions through every sentence of the corpus and every time a regular expression matches a pattern, if this pattern contains one of the nouns in the list of nouns, I would like to print this noun in the output and (divided by it on the tab) regular an expression that matched him. Here is an example of how the resulting result can be:

football    [a-z]+NN(S)?\ POS[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
football    [a-z]+NN(S)?[a-z]+NN(S)?
brother [a-z]+PP$[a-z]+NN(S)?
bollocks    [a-z]+DT[a-z]+NN(S)?
football    [a-z]+NN(s)?(be)VBZnotRB

, , ( ) ( <s>):

<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Portsmouth  Portsmouth  NP  1   2   SBJ
bring   bring   VVP 2   0   ROOT
something   something   NN  3   2   OBJ
entirely    entirely    RB  4   5   AMOD
different   different   JJ  5   3   NMOD
to  to  TO  6   5   AMOD
the the DT  7   12  NMOD
Premiership Premiership NP  8   12  NMOD
:   :   :   9   12  P
football    football    NN  10  12  NMOD
    POS 11  10  NMOD
past    past    NN  12  6   PMOD
.   .   SENT    13  2   P
</s>
<s>
This    this    DT  1   2   SBJ
is  be  VBZ 2   0   ROOT
one one CD  3   2   PRD
of  of  IN  4   3   NMOD
Britain Britain NP  5   10  NMOD
    POS 6   5   NMOD
most    most    RBS 7   8   AMOD
ardent  ardent  JJ  8   10  NMOD
football    football    NN  9   10  NMOD
cities  city    NNS 10  4   PMOD
:   :   :   11  2   P
think   think   VVP 12  2   COORD
Liverpool   Liverpool   NP  13  0   ROOT
or  or  CC  14  13  CC
Newcastle   Newcastle   NP  15  19  SBJ
in  in  IN  16  15  ADV
miniature   miniature   NN  17  16  PMOD
,   ,   ,   18  15  P
wound   wind    VVD 19  13  COORD
back    back    RB  20  19  ADV
three   three   CD  21  22  NMOD
decades decade  NNS 22  19  OBJ
.   .   SENT    23  2   P
</s>

script PERL , , , Tie:: File, script ( , ). , , , .

, , unix (, cat grep)? , ? ( ).

+4

regex grep perl nlp corpus

Albz 19 . '13 0:02

2

Regexp:: Assemble , . , , , .

, :

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @nouns = qw( hooligan football brother bollocks );
my @patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');

my $name_re = '(' . join('|', @nouns) . ')'; # Assumes no regex metacharacters

my $ra = Regexp::Assemble->new(track => 1);
$ra->add(@patterns);

local $/ = '<s>';

while (my $line = <DATA>) {
  my $match = $ra->match($line);
  next unless defined $match;

  while ($line =~ /$name_re/g) {
    say "$1\t\t$match";
  }
}


__DATA__
...

... __DATA__ - , . . , \t \s+; , , .

, :

hooligan        [a-z]+\s+NN(S)?
hooligan        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+JJ(S)?
football        [a-z]+\s+JJ(S)?

: . \t \s, NN JJ , . , \t.

+3

Dave Sherohman 19 . '13 8:44

Albz · Accepted Answer · 2013-09-20T23:22:31+0000

, . Tie:: File </s> , (, , , ). , (2- 3-), . , : , .

( ), - - .

, , , ( 10 : ).

use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.

if ($#ARGV < 0 ) {  print "Usage: perl albzcount.pl corpusfile\n"; exit; }

#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!"); 
my @nouns_contained_in_list=<DAT>;
close(DAT);

# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my @regexps_contained_in_list=<DAT>;
close(DAT);

# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)

# With TIE I don't load the entire file in an array. Perl thinks it an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my @raw_corpus_data, 'Tie::File', $corpusfile,  recsep => '</s>' or die "Can't read file: $!\n";

#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (@raw_corpus_data){

#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my @corpus_sublines = split('\n', $corpus_line); 

#declare variable. Later values will be appended to it
my $corpus_line; 

    #for each line that composes a sentence
    foreach my $sentence_newline(@corpus_sublines){ a

    #explode by tab (column separator)
    my @corpus_columns = split('\t', $sentence_newline); 

    #put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
    $corpus_line .= "@corpus_columns[1]\t@corpus_columns[2]\n";

    #... Now the corpus has the format I want and can be processed
    }

    #foreach regex
    foreach my $single_regexp(@regexps_contained_in_list){ 

        # Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file. 
        # Without this, the regular expressions read from the file don't always work.
        $single_regexp =~ s/\r|\n//g; 

            #if the corpus line analyzed in this cycle matches the regexp
            if($corpus_line =~ m/$single_regexp/) { 

            # explode by tab the matched results so the first word $onematch[0] can be isolated
            # $& is the entire matched string
            my @onematch = split('\t', $&);

                # OUTPUT RESULTS
                #if the matched noun is not empty and it is part of the word list
                if ($onematch[0] ne "" && grep( /^$onematch[0]$/, @nouns_contained_in_list )) { 
                print "$onematch[0]\t$single_regexp\n";
                } # END OUTPUT RESULTS
            } #END if the corpus line analyzed in this cycle matches the regexp
    } #END foreach regex
} #END go throught the lines of the corpus, one by one

# Untie the source corpus file
untie @raw_corpus_data;

How to match words from a list in a huge case using regexp (in a Perl terminal or * nix)?

More articles: