Python regular expressions extracting longest overlapping groups

How can I extract the longest of groups that start the same way.

For example, from this line I want to extract the longest match of either CS or CSI.

I tried this "(CS | CSI). *" And it will return CS, not CSI, even if CSI is available.

If I do this "(CSI | CS). *", Then I get CSI if it is a match, so I think the solution is to always put shorter overlapping groups after the longer one.

Is there a clearer way to express this with re? somehow it seems confusing that the result depends on how you connect the groups.

+3
source share
4 answers

, , , , Perle- , Python, JavaScript,.NET ..

http://www.regular-expressions.info/alternation.html

+3

, , , , :

import re

string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"

re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"

re_result = re.search(re_to_use,string_to_look_in)

print string_to_look_in[re_result.start():re_result.end()]
0

vim ( " " ), , , col\%[umn] col color, colum columbus column.

, python re, , ? quantifier, :

>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']
0

As Alan says, the patterns will be matched in the order in which you specify them.

If you want to match the longest overlapping literal string, you need the longest to be the first. But you can arrange your lines long and short automatically if you want:

>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'
0
source

Source: https://habr.com/ru/post/1745461/


All Articles