Python look-behind regex "fixed-width error" when looking for consecutive repeated words

I have text with words separated . , with instances of 2 and 3 consecutive repeating words:

  My  name.name .is.Inigo.Montoya.You.killed.my.  father.father.father .Prepare.to.die- 

I need to match them independently with a regular expression, excluding duplicates from three repetitions.

Since there is max. 3 consecutive repeating words, this

r'\b(\w+)\.+\1\.+\1\b'

successfully catches

  father.father.father 

However, in order to catch 2 consecutive repeating words, I need to make sure that the next and previous words do not match. I can make a negative look forward

r'\b(\w+)\.+\1(?!\.+\1)\b'

but my attempts in a negative way

r'(?<!(\w)\.)\b\1\.+\1\b(?!\.\1)'

either return a fixed-width problem (when I save + ), or some other problem.

How do I fix a negative look ?

+5
source share
2 answers

Perhaps regular expressions are not needed at all.

Using itertools.groupby does the job. It is designed to group equal occurrences of consecutive elements.

  • group of words (after splitting into dots)
  • convert to list and tuple value, count only if length> 1

like this:

 import itertools s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die" matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1] 

result:

 [('name', 2), ('father', 3)] 

So basically we can do whatever we want with this list of tuples (for example, to filter its number)

Bonus (since I first misunderstood the question, so I leave it): remove duplicates from the sentence - group by words (after splitting according to dots), as described above - accept only the key (value) of the values ​​returned in the comp list (we values ​​are not needed, since we do not take into account) - join with a point

In one line (still using itertools ):

 new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))]) 

result:

 My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die 
+3
source

I think there may be an easier way to capture what you want without a negative look:

 r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b') r.findall(t) > [('name.name.', 'name'), ('father.father.father', 'father')] 

Just make the third repetition optional.


The version for recording any number of repetitions of the same word may look something like this:

 r = re.compile(r'\b((\w+)(\.+\2)\3*)\b') r.findall(t) > [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')] 
+3
source

Source: https://habr.com/ru/post/1270241/


All Articles