The algorithm for choosing the best combination of two lists

I have a two-way search result. Thus, there are two lists that contain departure and arrival flights, such as:

  • The list of departing flights consists of 20 flights.
  • The list of arrival flights is 30 flights

So, I will have 600 (20 * 30) combinations between flight departure and arrival flight. I will call the list of combinations the list of results

However, I just want to choose a limit of 600 combinations. For example, I will choose the best of 100 flights. The criteria for combining flights is the cheap price of departure and arrival flight.

To do this, I will sort the list of results by the total price of departure and arrival flight. And then I pick up the first 100 items from the list of results to get what I want.

But if there are 200 flights in the list of departing flights and 300 flights in the list of arrival flights, I will have a list of results with 60,000 items. For this reason, I will sort the list with 60,000 items to find the top 100 items.

So, is there any algorithm for choosing the best combinations as my case.

Thank you very much.

0
source share
2 answers

It is not 100% clear from your question, but I understand that you are looking for a faster algorithm to search for a certain number of the best / cheapest combinations of departure and arrival flights.

You can do this much faster by sorting the departure and arrival flight lists individually by cost, and then using the heap to expand the next best combinations one by one until you have enough.

Here the complete algorithm is in Python, but without using any special libraries, only standard data structures, so this should be easily ported to any other language:

NUM_FLIGHTS, NUM_BEST = 1000, 100 # create test data: each entry corresponds to just the cost of one flight from random import randint dep = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)]) arr = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)]) def is_compatible(i, j): # for checking constraints, eg timing of flights return True # but for now, assume no constraints # get best combination using sorted lists and heap from heapq import heappush, heappop heap = [(dep[0] + arr[0], 0, 0)] # initial: best combination from dep and arr result = [] # the result list visited = set() # make sure not to add combinations twice while heap and len(result) < NUM_BEST: cost, i, j = heappop(heap) # get next-best combination if (i, j) in visited: continue # did we see those before? skip visited.add((i, j)) if is_compatible(i, j): # if 'compatible', add to results result.append((cost, dep[i], arr[j])) # add 'adjacent' combinations to the heap if i < len(dep) - 1: # next-best departure + same arrival heappush(heap, (dep[i+1] + arr[j], i+1, j)) if j < len(arr) - 1: # same departure + next-best arrival heappush(heap, (dep[i] + arr[j+1], i, j+1)) print result # just for testing: compare to brute-force (get best from all combinations) comb = [(d, a) for d in dep for a in arr] best = sorted((d+a, d, a) for (d, a) in comb)[:NUM_BEST] print best print result == best # True -> same results as brute force (just faster) 

This is something like this:

  • sort both dep departure and arr flights by cost
  • create a heap and put the best combination (the best departure and the best arrival), as well as the corresponding indices in your lists in the heap: (dep[0] + arr[0], 0, 0)
  • repeat until you have enough combinations or there are no more elements in the heap:
    • pull the best item from the heap (sorted by total cost)
    • if it meets the constraints, add it to the result set
    • make sure you don't add flights to the result set twice using the visited set
    • add two "adjacent" combinations to the heap, that is, taking the same flight from dep , and the next from arr , and the next from dep and the same from arr , i.e. (dep[i+1] + arr[j], i+1, j) and (dep[i] + arr[j+1], i, j+1)

Here is a very small processed example. The axes are (expenses) flights dep and arr , and the entries in the table are presented in the form n(c)m , where n is the iteration into which the entry in heap was added (if at all), c is the cost, and m - iteration added to the list of 'top 10' result (if any).

 dep\arr 1 3 4 6 7 2 0(3)1 1(5)4 4(6)8 8(8)- - 2 1(3)2 2(5)6 6(6)9 9(8)- - 3 2(4)3 3(6)7 7(7)- - - 4 3(5)5 5(7)- - - - 6 5(7)10 - - - - Result: (1,2), (1,2), (1,3), (3,2), (1,4), (3,2), (3,3), (2,4), (2,4), (1,6) 

Notice how the sums in the columns and rows of the matrix always increase, so the best results can always be found in the somewhat triangular area in the upper left corner. Now the idea is that if your best combination (the first on the heap) is dep[i], arr[i] , then there is no need to check, for example, the combination dep[i+2], arr[i] before checking dep[i+1], arr[i] , which should have a lower overall cost, so you add dep[i+1], arr[i] (and similarly dep[i], arr[i+1] ) and repeat by selecting the next item from the heap.


I compared the results of this algorithm with the results of your approach with brute force, and the received flights coincide, i.e. The algorithm works and always gives an optimal result. The complexity must be O (n log (n)) for sorting the departure and arrival lists (n is the number of flights in these source lists), plus O (m log (m)) for the heap cycle (m iterations with log (m) work on iteration, m is the number of elements in the list of results).

It finds the best 1000 combinations of 100,000 departures and 100,000 arrival flights (totaling 1,000,000,000,000 possible combinations) in less than one second.

Please note that these numbers are intended for the case when you do not have additional restrictions, i.e. Each departure flight can be combined with each arrival flight. If there are limitations, you can use the is_compatible function drawn in the code above to test them and skip this pairing. This means that for each incompatible pair with a low total cost, the cycle requires another iteration. This means that in the worst case, for example, if there are no compatible pairs at all or when the only compatible pairs are those with the maximum total cost, the algorithm can actually expand all the combinations.

On average, however, this should not be, and the algorithm should run pretty quickly.

+4
source

I think the best solution would be to use some SQL statements to work with the Cartesian product. You can apply any filters based on the data itself, order, range selection, etc. Something like that:

 SELECT d.time as dep_time, a.time as arr_time, d.price+a.price as total_price FROM departures d, arrivals a WHERE a.time > d.time + X ORDER BY d.price+a.price LIMIT 0,100 

In fact, X may be equal to 0, but the arrival should occur AFTER ABSENCE.

Why would I choose SQL:

  • This is the closest to the data itself, you do not need to request it
  • It is highly optimized if you use indexes, I am sure that you cannot beat your performance with your own code.
  • It is simple and declarative :)
+1
source

Source: https://habr.com/ru/post/1209900/


All Articles