See what you do.
You read all the points in the list named geo_points
.
Now, can you tell me if the list is sorted? Because if it was sorted, we definitely want to know. Sorting is valuable information, especially when you are dealing with 5 million things.
You geo_points
over all geo_points
. This is 5 million, in your opinion.
Inside the outer loop, you turn on all 5 million geo_points
.
You calculate the distance in miles between two elements of the cycle.
If the distance is less than your threshold, you write this information to the first point and stop the inner loop.
When the inner loop stops, you write information about the outer loop element to the CSV file.
Pay attention to a few things. First, you loop 5 million times in the outer loop. And then you loop 5 million times in the inner loop.
This is what O (nΒ²) means.
The next time you see someone talking about "Oh, this is O (log n), but the other thing is O (n log n)," remember this experience - you use the nΒ² algorithm, where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: Ultimately, you will end up comparing each point with yourself. Which should have a zero distance, that is, they will all be marked as within any distance threshold. If your program ever ends, all cells will be marked True.
Problem 2: When you compare point # 1 with, say, point # 12345, and are within a threshold distance from each other, you record this information about point # 1. But you do not record the same information about another point. You know that point number 12345 (geo_point2) is reflectively within the threshold of point number 1, but you are not recording it. Thus, you do not have a chance to miss more than 5 million comparisons.
Problem 3: If you compare points No. 1 and point No. 2, and they are not within the threshold distance, what happens when you compare point No. 2 with point No. 1? Your inner loop starts at the beginning of the list every time, but you know that you have already compared the beginning of the list to the end of the list. You can halve the problem space by simply making your outer loop go i in range(0, 5million)
and your inner loop j in range(i+1, 5million)
.
The answers
Consider your latitude and longitude on a flat plane. You want to know if there is a point within a radius of 5 miles. Think of a 10-kilometer square centered on your No. 1 point. This is a square centered (X1, Y1) with the upper left corner (X1 - 5miles, Y1 + 5miles) and the lower right corner (X1 + 5miles, Y1 - 5miles). Now, if the point is within this square, it may not be within 5 miles of your No. 1 point. But you can bet that if it is outside this square, it is more than 5 miles from the hotel.
As @SeverinPappadeaux points out, the distance on a spheroid like Earth is not quite the same as the distance on a flat plane. But what? Set the square a bit larger to account for the difference, and go on!
Sorted list
This is why sorting is important. If all points were sorted by X, then Y (or Y, then X - independently), and you knew that, you could really speed things up. Since you could just stop scanning when the X (or Y) coordinate got too big and you would not need to go through 5 million points.
How will it work? Same as before, except that the inner loop will have the following checks:
five_miles = ... # Whatever math, plus an error allowance! list_len = len(geo_points) # Don't call this 5 million times for i, pi in enumerate(geo_points): if pi.close_to_another_point: continue # Remember if close to an earlier point pi0max = pi[0] + five_miles pi1min = pi[1] - five_miles pi1max = pi[1] + five_miles for j in range(i+1, list_len): pj = geo_points[j] # Assumes geo_points is sorted on [0] then [1] if pj[0] > pi0max: # Can't possibly be close enough, nor any later points break if pj[1] < pi1min or pj[1] > pi1max: # Can't be close enough, but a later point might be continue # Now do "real" comparison using accurate functions. if ...: pi.close_to_another_point = True pj.close_to_another_point = True break
What am I doing there? First, I get some numbers in local variables. Then I use enumerate
to give me the value of i
and a reference to an external point. (What you called geo_point
). Then I quickly check if we know that this moment is close to another.
If not, we will have to scan. Therefore, I only look at the βlaterβ points in the list, because I know that the outer loop looks at the earlier ones, and I definitely do not want to compare the point with myself. I use several temporary variables to cache the results of calculations associated with the outer loop. Inside the inner loop, I make silly comparisons to temporary ones. They cannot tell me if the two points are close to each other, but I can check if they are definitely not close and skip ahead.
Finally, if simple checks pass, go ahead and complete costly checks. If the test really passes, be sure to record the result at both points, so we can skip the second part later.
Unsorted List
But what if the list is not sorted?
@RootTwo points you to the tree kD (where D is for "dimension" and k in this case is "2"). The idea is very simple if you already know about binary search trees: you look at dimensions, compare X at even levels in a tree and compare Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0): dimension = depth % 2 # even/odd -> lat/long dn = node.coord[dimension] dt = treenode.coord[dimension] if dn < dt: # go left if treenode.left is None: treenode.left = node else: insert_node(node, treenode.left, depth+1) else: # go right if treenode.right is None: treenode.right = node else: insert_node(node, treenode.right, depth+1)
What would it do? This will give you a searchable tree where points can be inserted in O (log n). This means that O (n log n) for the whole list, which is better than n squares! (A base base of 2 out of 5 million is basically 23. So, n log n 5 million times 23, compared with 5 million times 5 million!)
It also means that you can perform a targeted search. Since the tree is ordered, itβs pretty simple to look for βcloseβ points (the Wikipedia link from @RootTwo provides an algorithm).
Tip
My advice is simply to write code to sort the list if necessary. This is easier to write, and easier to check manually, and this is a separate passage that you will only need to do once.
After sorting the list, try the approach shown above. This is close to what you did, and it should be easy for you to understand and code.