Minimize cost line alignment with swaps only

Question

Minimize cost line alignment with swaps only

I am working on this problem for my own edification, and I can’t crack it:

Given two lines, which are permutations of the same superset of characters, find the minimum number of swaps needed to align them. Lines are circular (that is, you can change the first and last characters), can contain up to 2000 characters, and swaps can be performed on any line.

I tried several different approaches. One way is to sort the bubbles into two lines and cache all of their intermediate states, and then search for an intermediate state common to both that minimizes the sum of the number of steps for each row to achieve this intermediate state. This did not give the correct number (I have a few examples with the correct answer), and this obviously would not work for very large lines. I'm a little shy right now. I think I may have to use a modified version of the Demerau-Levenshtein distance, but how did you choose the minimum cost operation when only one type of operation is allowed?

Can someone point me in the right direction?

+4

string algorithm alignment dynamic-programming permutation

user1569339 Mar 04 '14 at 18:29

source share

1 answer

ldog · Answer 1 · 2014-03-05T02:39:32+0000

Consider a dynamic programming solution based on the A * algorithm. The idea is to simulate this problem as a graph and find the optimal path to the node target.

First, some basic clarifications. Each node in the graph is a string, and two nodes are neighbors if and only if they differ in one (circular) swap of adjacent characters. Running node is the first line, and node's goal is the second line. Finding the shortest path from start to goal clearly gives you the best solution.

. A * , , , . , , , , A * . , , , . , , , , , . , , -, . , -, (, ).

, , , .

_heuristic, , , , .

. , . , .

#!/usr/bin/python

import random
import string
import copy
from Queue import PriorityQueue as priority_queue

# SWAP AND COPY UTILITY FUNCTION
def swap(c,i,j):
    if j >= len(c):
        j = j % len(c)
    c = list(c)
    c[i], c[j] = c[j], c[i]
    return ''.join(c)

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE MAXIMUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION.
# NOTE: THIS HEURISTIC IS GUARANTEED TO BE AN ADMISSIBLE HEURISTIC, THEREFORE
# A* WILL ALWAYS FIND THE OPTIMAL SOLUTION USING THIS HEURISITC. IT IS HOWEVER
# NOT A STRONG HEURISTIC
def terrible_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    return max(longest_swaps)-1

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE SUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION DIVIDED BY SOME CONSTANT. THE LOWER THE CONSTANT
# THE FASTER A* COMPUTES THE SOLUTION, 
# NOTE: THIS HEURISTIC IS CURRENTLY NOT THEORETICALLY JUSTIFIED. A PROOF
# SHOULD BE FORMULATED AND THEORETICAL WORK SHOULD BE DONE TO DISCOVER
# WHAT IS THE MINIMAL CONSTANT ALLOWED FOR THIS HEURISTIC TO BE ADMISSIBLE.
def better_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    d = 0.
    for x in longest_swaps:
        d += x-1
    # THE CONSTANT TO DIVIDE THE SUM OF THE MINIMUM SWAPS, 1.5 SEEMS TO BE THE LOWEST
    # ONE CAN CHOOSE BEFORE A* NO LONGER RETURNS CORRECT SOLUTIONS
    constant = 1.5
    d /= constant
    return d


# GET ALL STRINGS ONE CAN FORM BY SWAPPING TWO CHARACTERS ONLY
def ngbs(x):
    n = set() # WE USE SET FOR THE PATHOLOGICAL CASE OF WHEN len(x) = 2
    for i in xrange(len(x)):
        n.add(swap(x,i,i+1))
    return n

# CONVENIENCE WRAPPER AROUND PYTHON priority_queue
class sane_priority_queue(priority_queue):
    def __init__(self):
        priority_queue.__init__(self)
        self.counter = 0

    def put(self, item, priority):
        priority_queue.put(self, (priority, self.counter, item))
        self.counter += 1

    def get(self, *args, **kwargs):
        _, _, item = priority_queue.get(self, *args, **kwargs)
        return item

# AN A* IMPLEMENTATION THAT USES EXPANDING DATA-TYPES BECAUSE OUR FULL SEARCH
# SPACE COULD BE MASSIVE. HEURISTIC FUNCTION CAN BE SPECIFIED AT RUNTIME.
def a_star(x0,goal,heuristic_func=terrible_heuristic):
    visited = set()
    frontier_visited = set()
    frontier = sane_priority_queue()
    distances = {}
    predecessors = {}

    predecessors[x0] = x0
    distances[x0] = 0
    frontier.put(x0,heuristic_func(x0,goal))

    while not frontier.empty():
        current = frontier.get()
        if current == goal:
            print "goal found, distance: ", distances[current], ' nodes explored: ', len(visited)
            return predecessors, distances
        visited.add(current)

        for n in ngbs(current):
            if n in visited:
                continue

            tentative_distance = distances[current] + 1
            if not distances.has_key(n) or tentative_distance < distances[n]:
                predecessors[n] = current
                distances[n] = tentative_distance
                heuristic_distance = tentative_distance + heuristic_func(n,goal)
                frontier.put(n,heuristic_distance)



# SIZE OF STRINGS TO WORK WITH
n = 10

# GENERATE RANDOM STRING
str1 = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(n)])

# RANDOMLY SHUFFLE
str2 = copy.deepcopy(str1)
l = list(str2)
random.shuffle(l)
str2 = ''.join(l)

# PRINT THE STRING FOR VISUAL DISPLAY
print 'str1', str1
print 'str2', str2

# RUN A* WITH THE TERRIBLE HEURISITIC FOR A KNOWN OPTIMAL SOLUTION
print 'a_star with terrible_heuristic:'
predecessors, distances = a_star(str1,str2,terrible_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

# RUN A* WITH A BETTER HEURISTIC THAT IS NOT JUSTIFIED THEORETICALLY
# TO BE ADMISSIBLE. THE PURPORSE IS TO COMPARE AGAINST THE KNOWN
# ADMISSIBLE HEURISTIC TO SEE EMPIRICALLY WHAT THE LOWEST WE CAN
# GO IS.
print 'a_star with better_heuristic:'
predecessors, distances = a_star(str1,str2,better_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

Minimize cost line alignment with swaps only

More articles: