Python look-behind regex "fixed-width error" when looking for consecutive repeated words

Question

Python look-behind regex "fixed-width error" when looking for consecutive repeated words

I have text with words separated . , with instances of 2 and 3 consecutive repeating words:

  My  name.name .is.Inigo.Montoya.You.killed.my.  father.father.father .Prepare.to.die-

I need to match them independently with a regular expression, excluding duplicates from three repetitions.

Since there is max. 3 consecutive repeating words, this

r'\b(\w+)\.+\1\.+\1\b'

successfully catches

  father.father.father

However, in order to catch 2 consecutive repeating words, I need to make sure that the next and previous words do not match. I can make a negative look forward

r'\b(\w+)\.+\1(?!\.+\1)\b'

but my attempts in a negative way

r'(?<!(\w)\.)\b\1\.+\1\b(?!\.\1)'

either return a fixed-width problem (when I save + ), or some other problem.

How do I fix a negative look ?

+5

python regex regex-lookarounds negative-lookahead

nacho Jul 26 '17 at 18:10

source share

2 answers

I think there may be an easier way to capture what you want without a negative look:

 r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b') r.findall(t) > [('name.name.', 'name'), ('father.father.father', 'father')]

Just make the third repetition optional.

The version for recording any number of repetitions of the same word may look something like this:

 r = re.compile(r'\b((\w+)(\.+\2)\3*)\b') r.findall(t) > [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')]

+3

joaoricardo000 Jul 26 '17 at 18:18

source share

Jean-François Fabre · Accepted Answer · 2017-07-26T18:18:55+0000

Perhaps regular expressions are not needed at all.

Using itertools.groupby does the job. It is designed to group equal occurrences of consecutive elements.

group of words (after splitting into dots)
convert to list and tuple value, count only if length> 1

like this:

 import itertools s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die" matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]

result:

 [('name', 2), ('father', 3)]

So basically we can do whatever we want with this list of tuples (for example, to filter its number)

Bonus (since I first misunderstood the question, so I leave it): remove duplicates from the sentence - group by words (after splitting according to dots), as described above - accept only the key (value) of the values returned in the comp list (we values are not needed, since we do not take into account) - join with a point

In one line (still using itertools ):

 new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])

result:

 My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die

Python look-behind regex "fixed-width error" when looking for consecutive repeated words

More articles: