String algorithm suggestion to find all common string list prefixes

What algorithm would you suggest finding the longest common string list prefixes?

I may have lines like:

Call Mike and schedule meeting. Call Lisa Call Adam and ask for quote. Implement new class for iPhone project Implement new class for Rails controller Buy groceries 

I want to know the following prefixes:

 "Call " "Implement new class " 

I will use Objective-C, so a cocoa turnkey solution will be a plus (albeit not a must).

+6
source share
4 answers

Edit: on clarified issue:

  • Row sorting
  • Find the longest common prefix of each adjacent pair
  • Sort and deduplicate common prefixes, and then remove any string prefix of another.

Actually, step (3) only requires removing any duplicate / prefix of the other, which you could do with trie or something else instead of sorting. In fact, it could be that all this can be done faster with a suitable annotated trie - if you turn on the โ€œcountโ€ in each node, then you are definitely looking at nodes with a score of 2+ that do not have children with a score of 2+.

But sorting is built-in, and once you are sorted, you can detect prefixes by looking at adjacent elements, so there may be less effort.

[Original answer:

Just a one-time operation, find the longest common prefix between all lines?

I would do this in terms of the length of the prefix. In pseudocode and in assuming nul-terminated strings:

 prefixlen = strlen(first_string); foreach string in the list { for (i = 0; i < prefixlen; ++i) { if (string[i] != first_string[i]) { prefixlen = i; break; } } if (prefixlen == 0) break; } common_prefix = substring(firststring, 0, prefixlen); 

]

+6
source

You can insert all your lines in trie (aka prefix tree). Then do a trie from the root until you find a node with multiple parents (or just stop inserting rows when you need to add a second child to node).

+3
source

It depends on what you are ready to consider the prefix.

I guess the general answer is to create a Trie (possibly a suffix tree) that stores all the lines in an n-ary tree. See http://en.wikipedia.org/wiki/Trie

enter image description here

Depending on your criteria for the "prefix" (for example, n characters), you can select all nodes of rank n in which there are several children.

You will have a list of duplicate prefixes.

+2
source
  • Insert all rows in the Trie data structure.
  • DFS from the root to find the first node that has more than 1 edge coming out of it.
  • the path from root to node, calculated in step 2, gives the longest common prefix for the entire rowset.
0
source

Source: https://habr.com/ru/post/892386/


All Articles