Shortest repeating sub-string

Question

Shortest repeating sub-string

I am looking for an efficient way to extract the shortest repeating substring. For instance:

input1 = 'dabcdbcdbcdd' ouput1 = 'bcd' input2 = 'cbabababac' output2 = 'ba'

I would appreciate any response or information related to the problem.

In addition, in this post, people suggest that we can use a regular expression like

 re=^(.*?)\1+$

to find the smallest repeating pattern in a row. But such an expression does not work in Python and always returns a mismatch to me (am I new to Python and may have missed something?).

--- follow up ---

Here the criterion is to look for the shortest pattern without overlapping, the length of which is greater than one and has the longest total length.

+6

python string-matching regex

Tim chen Dec 26 '11 at 8:07

source share

2 answers

^ matches the beginning of a line. In your example, repeating substrings do not start from the very beginning. Similarly for $ . Without ^ and $ pattern .*? always matches an empty string. Demo :

 import re def srp(s): return re.search(r'(.+?)\1+', s).group(1) print srp('dabcdbcdbcdd') # -> bcd print srp('cbabababac') # -> ba

Although he does not find the shortest substring.

+3

jfs Dec 26 '11 at 8:32

source share

Tim pietzcker · Accepted Answer · 2011-12-26T08:11:08+0000

A quick fix for this template might be

 (.+?)\1+

Your regex failed because it bound a repeating line to the beginning and end of the line, only allowing lines like abcabcabc , but not xabcabcabcx . In addition, the minimum length of a repeating line should be 1, not 0 (or any line will match), therefore .+? instead of .*? .

In Python:

 >>> import re >>> r = re.compile(r"(.+?)\1+") >>> r.findall("cbabababac") ['ba'] >>> r.findall("dabcdbcdbcdd") ['bcd']

But keep in mind that this regular expression will only find non-overlapping duplicate matches, so in the last example, the solution d will not be found, although this is the shortest repeating string. Or look at this example: here it cannot find abcd , because the abc part of the first abcd was used in the first match):

 >>> r.findall("abcabcdabcd") ['abc']

In addition, it can return multiple matches, so you need to find the shortest in the second step:

 >>> r.findall("abcdabcdabcabc") ['abcd', 'abc']

The best decision:

To allow the engine to find matching matches, use

 (.+?)(?=\1)

This will find several lines of two or more if they are repeated many times, but it will certainly find all possible duplicate substrings:

 >>> r = re.compile(r"(.+?)(?=\1)") >>> r.findall("dabcdbcdbcdd") ['bcd', 'bcd', 'd']

Therefore, you should sort the results by length and return the shortest:

 >>> min(r.findall("dabcdbcdbcdd") or [""], key=len) 'd'

or [""] (thanks to JF Sebastian!) ensures that no ValueError fires if there is no match at all.

Shortest repeating sub-string

More articles: