Regular vocabulary groups and parts of previously agreed word groups

Question

Regular vocabulary groups and parts of previously agreed word groups

I am trying to match groups of words with text. Basically I want every word with 4 or more characters, each group of 2 words, where the first word is 4 or more, and the second word is 3 or more characters, and each group of 3 words with the first word with 4 or more characters and second and third with 3 or more characters.

My problem is that my attempts to create a regular expression for this single return match only once for a specific part of the text where I would like to get all matches.

In the example, when I have this text: "This is an example of text to explain the problem that I am encountering with regex"

It should return an array with the following values:

This example text explain problem having with regular expression example text explain the having with with the regular expression explain the problem having with the with the regular

I tried both single and separate regular expressions, but the problem remains that it will only match one line of a line at a time. In the example, if I try the following regular expression:

 /\b(\w{4,}\s\w{3,}\s\w{3,})\b/

It must match

 having with the with the regular

I also tried

 /\b(?<triple>(?<double>(?<single>\w{4,})(\s\w{3,})?)(\s\w{3,})?)\b/

Which also only matches

 This example explain having regular example text explain the having with regular expression explain the problem having with the

Anyone who knows better how to solve this?

+4

php regex

riekelt Aug 30 '13 at 10:26

source share

2 answers

This question sounds interesting. I don't know php , but I decided to challenge myself to resolve it with python , which I'm more used to.

 import regex s = r"This is an example text to explain the problem I am having with the regular expression" [elem for t in regex.findall(r'\m(?|(((\w{4,})\W+\w{3,})\W+\w{3,})|((\w{4,})\W+\w{3,})|(\w{4,}))', s, overlapped=True) for elem in t if elem != '']

I used the regex module and its overlapped option, which starts the next match with the character following the current one. A regular expression returns tuples of the type:

 [('This', '', ''), ('example text', 'example', ''), ('text', '', ''), ('explain the problem', 'explain the', 'explain'), ('problem', '', ''), ('having with the', 'having with', 'having'), ('with the regular', 'with the', 'with'), ('regular expression', 'regular', ''), ('expression', '', '')]

So, from there I do another loop to extract those fields that are not empty, which gives:

 ['This', 'example text', 'example', 'text', 'explain the problem', 'explain the', 'explain', 'problem', 'having with the', 'having with', 'having', 'with the regular', 'with the', 'with', 'regular expression', 'regular', 'expression']

0

Birei Aug 30 '13 at 12:15

source share

cmbuckley · Accepted Answer · 2013-08-30T10:45:14+0000

The problem is that you want to catch overlapping patterns (for example, "have with" and "c"). You can do it with an insidious look ahead. I have not yet been able to combine into one regex with this method, but you could do something like this:

 $text = 'This is an example text to explain the problem I am having with the regular expression'; preg_match_all('/\b(\w{4,})\b/', $text, $matches1); preg_match_all('/\b(?=(\w{4,}\s+\w{3,}))\b/', $text, $matches2); preg_match_all('/\b(?=(\w{4,}\s+\w{3,}\s+\w{3,}))\b/', $text, $matches3); var_dump(array_merge($matches1[1], $matches2[1], $matches3[1]));

Regular vocabulary groups and parts of previously agreed word groups

More articles: