What is the best data structure and algorithm for comparing a list of strings?

I want to find the maximum possible sequence of words that match the following rules:

  • Each word can be used no more than once.
  • All words are strings
  • Two strings sa and sb can be combined if LAST two characters sa correspond to the first two characters sb .

In the case of concatenation, this is done by overlapping these characters. For instance:

  • sa = "torino"
  • sb = "novara"
  • sa concat sb = "torinovara"

For example, I have the following input file "input.txt":

Novara

Torino

Vercelli

Ravenna

Napoli

liverno

messania

Novi Ligure

Roma

And the output of the specified file in accordance with the above rules should be:

Torino

Novara

Ravenna

Napoli

Livorno

Novi Ligure

since the longest possible concatenation:

 torinovaravennapolivornovilligure 

Can anyone help me with this? What would be the best data structure for this?

+4
source share
2 answers

This can be represented as a task with a directed graph - nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is selected to get the longest length), and then finds the highest weight, the intersecting path.

(Well, actually, you want to expand the chart a bit to handle the start and end of the word. Attach a "start node" with an edge to each word of the word / 2 word length. Between the words: -decision + start length + length to end / 2 and between each word and the “final node” is a “word of length / 2". It may be easier to double it.)

https://cstheory.stackexchange.com/questions/3684/max-non-overlapping-path-in-weighted-graph

+5
source

I would start very simply. Make 2 line vectors, one of which is sorted normally, one of them is sorted by the last two letters. Create an index (vector ints) for the second vector, which indicates its position in the first.

To find the longest, first remove the orphans. words that do not match with anything. Then you want to build an adjacent tree joint, this is where you determine which words can reach each other. If you have two or more trees, first try the largest tree.

Now with the tree, your task is to find ends that are rare, and snap them to other ends and repeat. This should make you quite a pleasant decision, if it uses all the words of your golden, skip other trees. If this does not mean that you have introduced many algorithms to make it effective.

Some points to consider: If you have 3+ unique endings, you are guaranteed to drop 1+ words. This can be used to trim your attempts to find the answer. often repeat unique ends. The odd numbers of the given end guarantee that you need to drop it (you get 2 freebies at the ends). Separate words that can be combined, you can always drop them last, and otherwise they will ruin the math. You can create small self-consistent rings, you can treat them as matching words, if you do not force them into orphans when they are created. This may make the performance fantastic, but it does not guarantee the perfect solution.

Search space is an order (N!), A list of millions of elements can be difficult for an exact answer. Of course, I could forget something.

+1
source

Source: https://habr.com/ru/post/1333329/


All Articles