Effective most common suffix algorithm?

I have strings of several GBs, and for each prefix I want to find the 10 most common suffixes. Is there an efficient algorithm for this?

An obvious solution would be:

  • Keep a sorted list of pairs <string, count>.
  • Identify by the size of the binary search for the prefix we are looking for.
  • Find the 10 highest countin this degree.
  • Perhaps it will precompute it for all short prefixes, so it does not need to look at most of the data.

I'm not sure if this will really be effective. Is there a better way I forgot?

Responses should be in real time, but if necessary, the same amount of pre-processing may be required.

+3
source share
1 answer

Put the words in a tree, for example. trie or radix by placing a count of occurrences for each complete word, so you know which nodes are ends and how common they are.

Find the prefix / postfix combo by iteration.

Both of these operations: O (n * k), where k is the length of the longest word; this is as complex as a hash table.

HAT-trie is a promises cache version with high performance.

+6
source

Source: https://habr.com/ru/post/1748799/


All Articles