Grouping similar algorithms

I have a search engine. The search engine generates results when searching for a keyword. I need to find all other keywords that generate similar results.

For example, the keyword k1 gives a result set R1 = {1,2,3,4,5, ... 40}, which contains up to 40 document identifiers. And I need to get a list of all the other K1 keywords that generate results similar to what k1 generates.

The similarity S (R1, R2) between two result sets R1 and R2 is calculated as follows: 2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ) Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S (R1, R2) = (2 * | {2,3} |) / | {1,2,3} | + | {2,3,4,5} | = (2 * 2) / (3 + 4) = 4/7 = 0.57.

There are over 100,000 keywords, thus over 100,000 result sets. So far I have been able to solve this problem not easily, since O (N ^ 2), where each result set was compensated for by each other set. It takes a lot of time.

Is there anyone with a better idea?

Some similar messages that do not completely solve the problem:

+4
source share
3 answers

One question - results in sorted order?

Something that came to mind combined both sets, sorting and finding duplicates. It can be reduced to O (nlogn)

0
source

To keep the problem simple, it is assumed that all keywords have 10 results, and k1 is a keyword for comparison. You remove 9 results from the set of each keyword. Now compare the last result with k1, and the keywords with the same last result are what you want. If a keyword has 1 total result with k1, there is only a 1% chance that it will remain. A keyword with 5 results common with k1 will contain a 25% chance . Perhaps you think that 1% is too large, then you can repeat the process above n times, and a keyword with a total result of 1 will have a 1% chance of staying. Time O (N) .

0
source

Is your similarity criterion fixed, or can we apply a little variety to achieve a faster search engine?

Alternative:

The alternative that came to my mind:

Given your R1 result, you can go through the documents and create a histogram for the other keywords to which these documents will be attached. Then, if the given alternative keyword gets, say, at least # R1 / 2 hits, you specify it as β€œsimilar”.

The big difference is that you do not consider documents that are not in R1 at all.


Exact

If you need a solution according to your requirements, I believe that it would be enough to calculate R2, set only for those keywords that satisfy the above "alternative" criterion. I think (mathematical proof is necessary!), That if the "alternative" criterion is not met, there is no chance that yours will be.

0
source

Source: https://habr.com/ru/post/1341773/


All Articles