String algorithm suggestion to find all common string list prefixes

Question

String algorithm suggestion to find all common string list prefixes

What algorithm would you suggest finding the longest common string list prefixes?

I may have lines like:

Call Mike and schedule meeting. Call Lisa Call Adam and ask for quote. Implement new class for iPhone project Implement new class for Rails controller Buy groceries

I want to know the following prefixes:

 "Call " "Implement new class "

I will use Objective-C, so a cocoa turnkey solution will be a plus (albeit not a must).

+6

string algorithm search

cfischer Jul 9 '11 at 11:36

source share

4 answers

Steve jessop · Answer 1 · 2011-07-09T12:06:13+0000

Edit: on clarified issue:

Row sorting
Find the longest common prefix of each adjacent pair
Sort and deduplicate common prefixes, and then remove any string prefix of another.

Actually, step (3) only requires removing any duplicate / prefix of the other, which you could do with trie or something else instead of sorting. In fact, it could be that all this can be done faster with a suitable annotated trie - if you turn on the “count” in each node, then you are definitely looking at nodes with a score of 2+ that do not have children with a score of 2+.

But sorting is built-in, and once you are sorted, you can detect prefixes by looking at adjacent elements, so there may be less effort.

[Original answer:

Just a one-time operation, find the longest common prefix between all lines?

I would do this in terms of the length of the prefix. In pseudocode and in assuming nul-terminated strings:

 prefixlen = strlen(first_string); foreach string in the list { for (i = 0; i < prefixlen; ++i) { if (string[i] != first_string[i]) { prefixlen = i; break; } } if (prefixlen == 0) break; } common_prefix = substring(firststring, 0, prefixlen);

]

omz · Answer 2 · 2011-07-09T11:40:55+0000

You can insert all your lines in trie (aka prefix tree). Then do a trie from the root until you find a node with multiple parents (or just stop inserting rows when you need to add a second child to node).

sehe · Answer 3 · 2011-07-09T11:40:36+0000

It depends on what you are ready to consider the prefix.

I guess the general answer is to create a Trie (possibly a suffix tree) that stores all the lines in an n-ary tree. See http://en.wikipedia.org/wiki/Trie

Depending on your criteria for the "prefix" (for example, n characters), you can select all nodes of rank n in which there are several children.

You will have a list of duplicate prefixes.

Raul · Answer 4 · 2013-12-18T16:53:13+0000

Insert all rows in the Trie data structure.
DFS from the root to find the first node that has more than 1 edge coming out of it.
the path from root to node, calculated in step 2, gives the longest common prefix for the entire rowset.

String algorithm suggestion to find all common string list prefixes

More articles: