Effectively find a given subsequence in a string, maximizing the number of contiguous characters

Long problem description

Fuzzy string utilities such as fzf or CtrlP filter the list of strings for those that have a given search string as a subsequence. As an example, consider that a user wants to find a specific photo in a list of files. To find a file

/home/user/photos/2016/pyongyang_photo1.png

just type ph2016pngbecause this search string is a subsequence of this file name. (Remember that this is not LCS. The entire search string should be a subsequence of the file name.)

It is trivial to check if a given search string is a subsequence of another string, but I wonder how to get the best match effectively: in the above example, there are several possible matches. One of them -

/home/user/photos/2016/pyongyang_photo1.png

but the one that the user probably meant

/home/user/photos/2016/pyongyang_photo1.png

To formalize this, I would define a “best” match as one that consists of the fewest substrings. This number is 5 for the first match of the example and 3 for the second.

I came up with this because it would be interesting to get the best fit to assign a rating to each result, for sorting. I am not interested in approximate solutions, but my interest in this problem is primarily academic in nature.

tl; dr problem description

s t t, s, , t.

,

s t. fuzzy(s, t). Python. :

s , s[0] t ( i),

t[:i+1] + fuzzy(s[1:], t[i+1:])    # Use the character
t[:i]   + fuzzy(s,     t[i+1:])    # Skip it and use the next occurence 
                                   # of s[0] in t instead

, , . En contraire, . ( s[-1] , , .)


→ : ?

+4
2

, node , .

.

node - , , , , node.

, , , . - . : , , .

, , , node , .

Python . , .

def fuzzy_trincot(haystack, needle, returnSegments = False):
    inf = float('inf')

    def getSolutionAt(node, depth, optimalCount = 2):
        if not depth: # reached end of needle
            node['count'] = 0
            return
        minCount = inf # infinity ensures also that incomplete branches are pruned
        child = node['child']
        i = node['i']+1
        # Optimisation: optimalCount gives the theoretical minimum number of  
        # segments needed for any solution. If we find such case, 
        # there is no need to continue the search.
        while child and minCount > optimalCount:
            # If this node was already evaluated, don't lose time recursing again.
            # It works without this condition, but that is less optimal.
            if 'count' not in child:
                getSolutionAt(child, depth-1, 1)
            count = child['count'] + (i < child['i'])
            if count < minCount:
                minCount = count
            child = child['sibling']
        # Store the results we found in this node, so if ever we come here again,
        # we don't need to recurse the same sub-tree again.
        node['count'] = minCount

    # Preprocessing: build tree
    # A node represents a needle character occurrence in the haystack.
    # A node can have these keys:
    #   i:       index in haystack where needle character occurs
    #   child:   node that represents a match, at the right of this index, 
    #            for the next needle character
    #   sibling: node that represents the next match for this needle character
    #   count:   the least number of additional segments needed for matching the 
    #            remaining needle characters (only; so not counting the segments
    #            already taken at the left)
    root = { 'i': -2, 'child': None, 'sibling': None }
    # Take a short-cut for when needle is a substring of haystack
    if haystack.find(needle) != -1:
        root['count'] = 1
    else:
        parent = root
        leftMostIndex = 0
        rightMostIndex = len(haystack)-len(needle)
        for j, c in enumerate(needle):
            sibling = None
            child = None
            # Use of leftMostIndex is an optimisation; it works without this argument
            i = haystack.find(c, leftMostIndex)
            # Use of rightMostIndex is an optimisation; it works without this test
            while 0 <= i <= rightMostIndex:
                node = { 'i': i, 'child': None, 'sibling': None }
                while parent and parent['i'] < i:
                    parent['child'] = node
                    parent = parent['sibling']
                if sibling: # not first child
                    sibling['sibling'] = node
                else: # first child
                    child = node
                    leftMostIndex = i+1
                sibling = node
                i = haystack.find(c, i+1)
            if not child: return False
            parent = child
            rightMostIndex += 1
        getSolutionAt(root, len(needle))

    count = root['count']
    if not returnSegments:
        return count

    # Use the `returnSegments` option when you need the character content 
    # of the segments instead of only the count. It runs in linear time.

    if count == 1: # Deal with short-cut case 
        return [needle]
    segments = []
    node = root['child']
    i = -2
    start = 0
    for end, c in enumerate(needle):
        i += 1
        # Find best child among siblings
        while (node['count'] > count - (i < node['i'])):
            node = node['sibling']
        if count > node['count']:
            count = node['count']
            if end:
                segments.append(needle[start:end])
                start = end
        i = node['i']
        node = node['child']
    segments.append(needle[start:])
    return segments

:

haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"

print (fuzzy_trincot(haystack, needle))

print (fuzzy_trincot(haystack, needle, True))

:

3
['ph', '2016', 'png']

, .

+1

, , , . , . /home/user/photos/2016/pyongyang_photo1.png - , ph2016png - .

( ) , (), .


, . , , ASCII, 256 ( , 128 ).

"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...


, , . , , ( ). , , . , , , . .

"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"


. (, ababa () babaa (input) aba, baba), , . , , .

Since there is no instance of incomplete match with your example,
let take something else, made to illustrate the point.
Let take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"

If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

, , , , .

With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"

, , . , , , .

+1

Source: https://habr.com/ru/post/1654020/


All Articles