Search for duplicate substrings

Having some arbitrary string like

hello hello hello I am I am I am your string string string string of strings 

Is it possible to somehow find duplicate substrings separated by spaces (EDIT)? In this case, it will be "hello", "I am" and "string".

I was interested in this for a while, but I still can not find any real solution. I also read some articles on this topic and hit the suffix trees, but can this help me, although I need to find every repetition, for example. with the number of repetitions above two?

If so, is there some kind of python library that can handle suffix trees and perform operations on them?

Edit: I'm sorry that I was not clear enough. Therefore, just to make it clear - I'm looking for duplicate substrings, this means that sequences in a string, which, for example, in terms of regular expressions, can be replaced with + or {} wildcards. So if I need to do a regular expression from the specified string, I would do

 (hello ){3}(I am ){3}your (string ){4}of strings 
+6
source share
1 answer

To find two or more characters repeating two or more times, each of which is limited by spaces, use:

 (.{2,}?)(?:\s+\1)+ 

Here is a working example with your test string: http://bit.ly/17cKX62

EDIT: made the quantifier in the capture group reluctant by adding? to match the shortest match (that is, it now matches "string" rather than "string string")

EDIT 2: Added required space delimiter for cleaner results

+3
source

Source: https://habr.com/ru/post/952887/


All Articles