I have a search engine. The search engine generates results when searching for a keyword. I need to find all other keywords that generate similar results.
For example, the keyword k1 gives a result set R1 = {1,2,3,4,5, ... 40}, which contains up to 40 document identifiers. And I need to get a list of all the other K1 keywords that generate results similar to what k1 generates.
The similarity S (R1, R2) between two result sets R1 and R2 is calculated as follows: 2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ) Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S (R1, R2) = (2 * | {2,3} |) / | {1,2,3} | + | {2,3,4,5} | = (2 * 2) / (3 + 4) = 4/7 = 0.57.
There are over 100,000 keywords, thus over 100,000 result sets. So far I have been able to solve this problem not easily, since O (N ^ 2), where each result set was compensated for by each other set. It takes a lot of time.
Is there anyone with a better idea?
Some similar messages that do not completely solve the problem: