How can I designate a word given to tokens that are not fully included in this word?

Question

How can I designate a word given to tokens that are not fully included in this word?

I understand how to use regex in Perl as follows:

$str =~ s/expression/replacement/g;

I understand that if any part of an expression is enclosed in parentheses, it can be used and written in the replacement part, for example:

 $str =~ s/(a)/($1)dosomething/;

But is there a way to grab ($1) higher outside the regex expression?

I have a complete word, which is a string of consonants, for example. bEdmA , its vowel version of baEodamaA (where a and o are vowels), as well as its separate form of two tokens, separated by a space, bEd maA . I just want to get the vowel form of tokens from the full word, for example: beEoda , maA . I am trying to capture a token in full text expression, so I have:

 $unvowelizedword = "bEdmA"; $tokens[0] = "bEd", $tokens[1] = "mA"; $vowelizedword = "baEodamA"; foreach $t(@tokens) { #find the token within the full word, and capture its vowels }

I am trying to do something like this:

 $vowelizedword = m/($t)/;

This is completely wrong for two reasons: the $t token is missing in its own form, such as bEd , but something like m/bEd/ would be more relevant. Also, how can I capture it in a variable outside the regular expression?

The real question is: how can I capture the baEoda and maA , given the bEd , mA tokens from the full word beEodamaA ?

Edit

I realized from all the answers that I missed two important details.

Vowels are optional. So, if the tokens are “Al” and “ywm,” and the fully vowel is “Alyawmi,” then the output tokens will be “Al” and “yawmi.”
I mentioned only two vowels, but there are more, including two-character characters, like '~ a'. Full list (although I don't think I need to mention it here):
@vowels = ('a', 'i', 'u', 'o', '~', '~ a', '~ i', '~ u', 'N', 'F', 'K' , '~ N', '~ K');

+4

regex perl token

user961627 Dec 14 '11 at 10:03

source share

5 answers

Use the m// operator in the so-called "list context" like this:

my @tokens = ($input =~ m/capturing_regex_here/modifiershere);

0

fge Dec 14 '11 at 10:12

source share

ETA: From what I now understand, you tried to say that you want to match an optional vowel after each token character.

With this, you can configure the $vowels variable to contain only the letters you are looking for. You can also just use it if you wish . to capture any character.

 use strict; use warnings; use Data::Dumper; my @tokens = ("bEd", "mA"); my $full = "baEodamA"; my $vowels = "[aeiouy]"; my @matches; for my $rx (@tokens) { $rx =~ s/.\K/$vowels?/g; if ($full =~ /$rx/) { push @matches, $full =~ /$rx/g; } } print Dumper \@matches;

Output:

 $VAR1 = [ 'baEoda', 'mA' ];

note that

 ... $full =~ /$rx/g;

does not require capturing groups in regular expression.

0

TLP Dec 14 '11 at 10:25

source share

I suspect there is an easier way to do what you are trying to accomplish. The trick is not to make the regular expression code so complex that you forget what it actually does.

I can only begin to guess your task, but from your only example, it looks like you want to check that two subtokens are in a larger token, ignoring certain characters. I am going to suggest that these sub-tokens should be in order and cannot have anything else between them except these vowels.

To map tokens, I can use the \G anchor with the /g global flag in a scalar context. This binds the match to the character one after the end of the last match for the same scalar. That way I can have separate templates for each sub-token. This is a lot easier to manage since I only need to change the list of values in @subtokens .

Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.

 use v5.14; my $vowels = '[ao]*'; my @subtokens = qw(bEd mA); # prepare the subtoken regular expressions my @patterns = map { my $s = join "$vowels", map quotemeta, (split( // ), ''); qr/$s/; } @subtokens; my @tokens = qw( baEodamA mAabaEod baEoda mAbaEoda ); my @grand_matches; TOKEN: foreach my $token ( @tokens ) { say "-------\nMatching $token.........."; my @matches; PATTERN: foreach my $pattern ( @patterns ) { say "Position is ", pos($token) // 0; # scalar context /g and \G next TOKEN unless $token =~ /\G($pattern)/g; push @matches, $1; say "Matched with $pattern"; } push @grand_matches, [ $token, \@matches ]; } # Now report the original foreach my $tuple ( @grand_matches ) { say "$tuple->[0] has both fragments: @{$tuple->[1]}"; }

Now, here is a good thing about this structure. I probably guessed about your task. If I have it, it’s easy to fix without changing the settings. Let's say that tricks should not be in order. This is an easy change to the template I created. I just get rid of \G and the /g flag;

  next TOKEN unless $token =~ /($pattern)/;

Or, suppose the tokens should be in order, but there may be other things between them. I can insert .*? to match this material, actually skipping it:

  next TOKEN unless $token =~ /\G.*?($pattern)/g;

It would be much better if I could do all this from map , where I created the templates, but the /g flag is not a template flag. He must go with the operator.

It’s much easier for me to manage changing requirements when I don’t wrap everything in one regular expression.

0

brian d foy Dec 14 '11 at 12:59

source share

Assuming tokens should appear in order and without anything (other than a vowel) between them:

 my @tokens = ( "bEd", "mA" ); my $vowelizedword = "baEodamaA"; my $vowels = '[ao]'; my (@vowelized_sequences) = $vowelizedword =~ ( '^' . join( '', map "(" . join( $vowels, split( //, $_ ) ) . "(?:$vowels)?)", @tokens ) . '\\z' ); print for @vowelized_sequences;

-1

ysth Dec 14 '11 at 10:32

source share

holygeek · Accepted Answer · 2011-12-14T10:24:12+0000

It seems that you are doing the following:

 #!/usr/bin/env perl use warnings; use strict; my @tokens = ('bEd', 'mA'); my $vowelizedword = "beEodamaA"; my @regex = map { join('.?', split //) . '.?' } @tokens; my $regex = join('|', @regex); $regex = qr/($regex)/; while (my ($matched) = $vowelizedword =~ $regex) { $vowelizedword =~ s{$regex}{}; print "matched $matched\n"; }

Update according to your updated question (vowels are optional). It works from the end of the line, so you will need to collect the markers into an array and print them in reverse order:

 #!/usr/bin/env perl use warnings; use strict; my @tokens = ('bEd', 'mA', 'Al', 'ywm'); my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work. my @regex = map { join('.?', split //) . '.?$' } @tokens; my $regex = join('|', @regex); $regex = qr/($regex)/; while (my ($matched) = $vowelizedword =~ $regex) { $vowelizedword =~ s{$regex}{}; print "matched $matched\n"; }

How can I designate a word given to tokens that are not fully included in this word?

Edit

More articles: