Learning order algorithm (ideally in Java)

I have a list of ordered lists, most of which contain the same elements. I want to find the most likely order of items from lists (samples).

Example:

l1={ a, b, f, h, z } l2={ c, e, h, x, z } l3={ a, e, y, z } l4={ b, e, f, z } 

The result should be:

 R={a, b, c, e, f, h, x, y, z}; or R={ a,b,c,e,f,h,y,x,z } 

Elements have no information regarding their natural order. The order must be learned from the lists, and in some cases the order in the list may contradict other lists, so I need the most likely order. I have about 175,000 lists, about 1.8 million items (total, 260 thousand unique), the number of items in the list varies.

I already tried to build a directed graph, where the edges have the number of lists that connect the vertices in that order, and then went through all the paths to find the most likely sequence. This approach is well suited for small problems, but it is too complex to solve this problem.

Any pointers please be greatly appreciated.

Thanks.

Juan

+5
source share
1 answer

I think your problem is very similar to your problem with developing a rating system for multiplayer games. Unfortunately, I do not see an easy answer for this, especially considering your amount of data. I would be inclined to consider each list of N elements as N-1 dual-player games, each of which recorded a contest between a player and a player just above them in the list. If you can afford it, you can consider each list as N (N-1) / 2 games with two players, recording all the comparisons in the list. In any case, you could apply a rating system for games with two players, for example https://en.wikipedia.org/wiki/Elo_rating_system .

Another approach would be to write down the penalty function for the goodness of fitting any order, and then try to minimize the penalty. There are a number of functions that compare two lists with each other, for example https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient and https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient . Kendall’s rank correlation is based only on the number of pairwise comparisons that you make in one list if you used the other as a predictor, so it can have some good properties. You might decide that your total list penalty was the sum of all the penalties that you calculate when you in turn compare your total list with each of the input lists.

One way to minimize such a fine would be to start by randomly ordering, and then re-remove the item from the order and return it, depending on which place minimizes the fine function until such a change improves the situation. Unfortunately, given your amount of data, I don’t think you can afford it.

If you are ready to turn your data into a list of dual-player games between players with unknown strengths, then you can use various approaches. If you represent the strengths of all players with one vector, for example (strengthA, strengthB, strengthC, ...), then the probability that the B-beating of B may depend on the point product of this vector with the vector (1, - 1, 0 ,. ...). This suggests that you can try to find a good shape with logistic regression, a perceptron model or supporting vector machines.

+3
source

Source: https://habr.com/ru/post/1268201/


All Articles