Extract substring from string using regex in perl?

try extracting the substrings matching the pattern in the string. for example i have text like below

[ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ] ./. [ Mr./NNP Vinken/NNP ] is/VBZ [ chairman/NN ] of/IN 

and I want to extract everything that was before the slash (/), and everything that was after the slash, but somehow my regular expression extracts the first substring and ignores the rest of the substrings in the string.

My output looks something like this:

 tag:Pierre/NNP Vinken - word:Pierre/NNP Vinken/NNP ->1 tag:, - word:,/, ->1 tag:61/CD years - word:61/CD years/NNS ->1 tag:old/JJ ,/, will/MD join - word:old/JJ ,/, will/MD join/VB ->1 tag:the/DT board - word:the/DT board/NN ->1 tag:as - word:as/IN ->1 tag:a/DT nonexecutive/JJ director/NN Nov./NNP 29 - word:a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ->1 tag:. - word:./. ->1 tag:Mr./NNP Vinken - word:Mr./NNP Vinken/NNP ->1 tag:is - word:is/VBZ ->1 tag:chairman - word:chairman/NN ->1 tag:of - word:of/IN ->1 

but what I really want looks something like this:

 tag:NNP - word:Pierre ->1 tag:NNP - word:Vinken ->1 tag:, - word:, ->1 tag:CD - word:61 ->1 . . etc. 

The code I used:

  while (my $line = <$fh>) { chomp $line; #remove square brackets $line=~s/[\[\]]//; while($line =~m/((\s*(.*))\/((.*)\s+))/gi) { $word=$1; $tag=$2; #remove whitespace from left and right of string $word=~ s/^\s+|\s+$//g; $tag=~ s/^\s+|\s+$//g; $tags{$tag}++; $tagHash{$tag}{$word}++; } } foreach my $str (sort keys %tagHash) { foreach my $s (keys %{$tagHash{$str}} ) { print "tags:$str - word: $s-> $tagHash{$str}{$s}\n"; } } 

any idea why my regex isn't behaving as it should be

EDIT:

in the text files that I process, it has a wild character and punctuation, which means that the files will have something like this: '' / '' "/", /, ./. ? /? ! /! , , , etc.

therefore, I want to capture all of these things with more than just alphanumeric characters.

+5
source share
2 answers

The outermost set of parentheses around your entire drawing is captured at $1 , which is clearly not intended. In addition, greed .*\/ Means that it takes everything to the last / . Similarly,. .*\s+ leaves only the very last space.

One way to do this is to use a negative character class

 my ($word, $tag) = m{ ([^/\s]+) / ([^/\s]+) }x; 

The pattern [^/\s]+ matches a string of one or more consecutive characters, each of which is any, except / or spaces. Thus, you get a "word" before and after / . If you take "everything after the slash", as the text says it is unclear what should be before the next slash.

Your approach may now look like

 while (my $line = <$fh>) { while ( $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx ) { $tagHash{$2}{$1}++; } } 

Another account seems unrelated, so I left it to focus on the question.


However, there is a bit missing.

This approach cannot determine when the string is different than expected. for instance

  word1 / tag1 word2 / tag2 / tag3 / word4 / tag4

produces the wrong results, quietly. Some violations are overlooked, but there are many bad cases.

One way to catch this is to pre-process the string by checking that there are at least two words between all slashes and at least one before the first and last. This means that each line is processed twice, and also becomes more messy. for instance

 while (my $line = <$fh>) { my @parts = split '/', $line; if (not shift @parts or not pop @parts or grep { 2 > split } @parts) { warn "Unexpected format: $line"; next; } $tagHash{$2}{$1}++ while $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx; } 

This check modifies the @parts array, so if this array is needed later, you'd better use

 if (!$parts[0] or !$parts[-1] or grep { 2 > split } @parts[ 1..@parts-2 ]) { ... 

where instead of grep you can also use the short circuit any from List :: Util

Another way is to change the approach and carefully analyze the line, rather than blindly jumping onto matches of regular expressions. Since the first and last can only have one word, it can be difficult to do with a regular expression. Probably more clear and practical to split and work with the array.

It is difficult to imagine a format that always matches the data, so I would suggest considering some of them.

+1
source

I think you have tag/word , that tag and word can be everything except for some characters like ],[,\s,

 \s*([^\[\]\s]+?)\/([^\[\]\s]+)\s* ^^^^^^^^^1 

This regex is similar to your original pattern. (See DEMO )

Description:

1- This capture group corresponds to each character . which is not [ , ] or \s

+2
source

Source: https://habr.com/ru/post/1265335/


All Articles