I have strings of several GBs, and for each prefix I want to find the 10 most common suffixes. Is there an efficient algorithm for this?
An obvious solution would be:
- Keep a sorted list of pairs
<string, count>. - Identify by the size of the binary search for the prefix we are looking for.
- Find the 10 highest
countin this degree. - Perhaps it will precompute it for all short prefixes, so it does not need to look at most of the data.
I'm not sure if this will really be effective. Is there a better way I forgot?
Responses should be in real time, but if necessary, the same amount of pre-processing may be required.
source
share