Python regular expressions extracting longest overlapping groups

Question

Python regular expressions extracting longest overlapping groups

How can I extract the longest of groups that start the same way.

For example, from this line I want to extract the longest match of either CS or CSI.

I tried this "(CS | CSI). *" And it will return CS, not CSI, even if CSI is available.

If I do this "(CSI | CS). *", Then I get CSI if it is a match, so I think the solution is to always put shorter overlapping groups after the longer one.

Is there a clearer way to express this with re? somehow it seems confusing that the result depends on how you connect the groups.

+3

python regex

user265454 May 14, '10 at 15:09

source share

4 answers

, , , , :

import re

string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"

re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"

re_result = re.search(re_to_use,string_to_look_in)

print string_to_look_in[re_result.start():re_result.end()]

0

sjh 14 '10 15:41

vim ( " " ), , , col\%[umn] col color, colum columbus column.

, python re, , ? quantifier, :

>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']

0

mykhal 14 '10 15:52

As Alan says, the patterns will be matched in the order in which you specify them.

If you want to match the longest overlapping literal string, you need the longest to be the first. But you can arrange your lines long and short automatically if you want:

>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'

0

Matt anderson May 14, '10 at 15:56

source share

Alan Moore · Accepted Answer · 2010-05-14T15:47:41+0000

, , , , Perle- , Python, JavaScript,.NET ..

http://www.regular-expressions.info/alternation.html

Python regular expressions extracting longest overlapping groups

More articles: