Shortest repeating sub-string

I am looking for an efficient way to extract the shortest repeating substring. For instance:

input1 = 'dabcdbcdbcdd' ouput1 = 'bcd' input2 = 'cbabababac' output2 = 'ba' 

I would appreciate any response or information related to the problem.

In addition, in this post, people suggest that we can use a regular expression like

 re=^(.*?)\1+$ 

to find the smallest repeating pattern in a row. But such an expression does not work in Python and always returns a mismatch to me (am I new to Python and may have missed something?).

--- follow up ---

Here the criterion is to look for the shortest pattern without overlapping, the length of which is greater than one and has the longest total length.

+6
source share
2 answers

A quick fix for this template might be

 (.+?)\1+ 

Your regex failed because it bound a repeating line to the beginning and end of the line, only allowing lines like abcabcabc , but not xabcabcabcx . In addition, the minimum length of a repeating line should be 1, not 0 (or any line will match), therefore .+? instead of .*? .

In Python:

 >>> import re >>> r = re.compile(r"(.+?)\1+") >>> r.findall("cbabababac") ['ba'] >>> r.findall("dabcdbcdbcdd") ['bcd'] 

But keep in mind that this regular expression will only find non-overlapping duplicate matches, so in the last example, the solution d will not be found, although this is the shortest repeating string. Or look at this example: here it cannot find abcd , because the abc part of the first abcd was used in the first match):

 >>> r.findall("abcabcdabcd") ['abc'] 

In addition, it can return multiple matches, so you need to find the shortest in the second step:

 >>> r.findall("abcdabcdabcabc") ['abcd', 'abc'] 

The best decision:

To allow the engine to find matching matches, use

 (.+?)(?=\1) 

This will find several lines of two or more if they are repeated many times, but it will certainly find all possible duplicate substrings:

 >>> r = re.compile(r"(.+?)(?=\1)") >>> r.findall("dabcdbcdbcdd") ['bcd', 'bcd', 'd'] 

Therefore, you should sort the results by length and return the shortest:

 >>> min(r.findall("dabcdbcdbcdd") or [""], key=len) 'd' 

or [""] (thanks to JF Sebastian!) ensures that no ValueError fires if there is no match at all.

+14
source

^ matches the beginning of a line. In your example, repeating substrings do not start from the very beginning. Similarly for $ . Without ^ and $ pattern .*? always matches an empty string. Demo :

 import re def srp(s): return re.search(r'(.+?)\1+', s).group(1) print srp('dabcdbcdbcdd') # -> bcd print srp('cbabababac') # -> ba 

Although he does not find the shortest substring.

+3
source

Source: https://habr.com/ru/post/904476/


All Articles