Faster processing of 5 million lines of coordinate data

Question

Faster processing of 5 million lines of coordinate data

I have a csv file with two columns (latitude, longitude) that contains more than 5 million rows of geolocation data. I need to identify points that are not within 5 miles of any other point in the list, and output everything back to another CSV that has an extra column ( CloseToAnotherPoint ) that is True if there is another point within 5 miles and False if none exist.

Here is my current solution using geopy (without making any web calls, just using the function to calculate the distance):

 from geopy.point import Point from geopy.distance import vincenty import csv class CustomGeoPoint(object): def __init__(self, latitude, longitude): self.location = Point(latitude, longitude) self.close_to_another_point = False try: output = open('output.csv','w') writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL) writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint']) # 5 miles close_limit = 5 geo_points = [] with open('geo_input.csv', newline='') as geo_csv: reader = csv.reader(geo_csv) next(reader, None) # skip the headers for row in reader: geo_points.append(CustomGeoPoint(row[0], row[1])) # for every point, look at every point until one is found within 5 miles for geo_point in geo_points: for geo_point2 in geo_points: dist = vincenty(geo_point.location, geo_point2.location).miles if 0 < dist <= close_limit: # (0,close_limit] geo_point.close_to_another_point = True break writer.writerow([geo_point.location.latitude, geo_point.location.longitude, geo_point.close_to_another_point]) finally: output.close()

As you might say, looking at this, this decision is very slow. So slow that I let him work for 3 days and it still wasn't over!

I was thinking of trying to split the data into chunks (several CSV files or something else) so that the inner loop would not look at every other point, but then I would have to figure out how to make sure that the borders of each section are checked at the borders of its adjacent sections, and it seems too complicated, and I'm afraid it will be a headache rather than worth it.

So all the pointers to how to do this faster?

+5

python csv geopy

zero01alpha Mar 14 '16 at 2:00

source share

8 answers

As Python points out, calculating a large number of distances quickly is a classic use case for kD trees.

An alternative is to use the sweep algorithm, as shown in the answer. How do I match similar coordinates using Python?

Here, the scan line algorithm is adapted to your questions. It takes <5 minutes on my laptop to go through 5 M random points.

 import itertools as it import operator as op import sortedcontainers # handy library on Pypi import time from collections import namedtuple from math import cos, degrees, pi, radians, sqrt from random import sample, uniform Point = namedtuple("Point", "lat long has_close_neighbor") miles_per_degree = 69 number_of_points = 5000000 data = [Point(uniform( -88.0, 88.0), # lat uniform(-180.0, 180.0), # long True ) for _ in range(number_of_points) ] start = time.time() # Note: lat is first in Point, so data is sorted by .lat then .long. data.sort() print(time.time() - start) # Parameter that determines the size of a sliding lattitude window # and therefore how close two points need to be to be to get flagged. threshold = 5.0 # miles lat_span = threshold / miles_per_degree coarse_threshold = (.98 * threshold)**2 # Sliding lattitude window. Within the window, observations are # ordered by longitude. window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long')) # lag_pt is the 'southernmost' point within the sliding window. point = iter(data) lag_pt = next(point) milepost = len(data)//10 # lead_pt is the 'northernmost' point in the sliding window. for i, lead_pt in enumerate(data): if i == milepost: print('.', end=' ') milepost += len(data)//10 # Dec of lead_obs represents the leading edge of window. window.add(lead_pt) # Remove observations further than the trailing edge of window. while lead_pt.lat - lag_pt.lat > lat_span: window.discard(lag_pt) lag_pt = next(point) # Calculate 'east-west' width of window_size at dec of lead_obs long_span = lat_span / cos(radians(lead_pt.lat)) east_long = lead_pt.long + long_span west_long = lead_pt.long - long_span # Check all observations in the sliding window within # long_span of lead_pt. for other_pt in window.irange_key(west_long, east_long): if other_pt != lead_pt: # lead_pt is at the top center of a box 2 * long_span wide by # 1 * long_span tall. other_pt is is in that box. If desired, # put additional fine-grained 'closeness' tests here. # coarse check if any pts within 80% of threshold distance # then don't need to check distance to any more neighbors average_lat = (other_pt.lat + lead_pt.lat) / 2 delta_lat = other_pt.lat - lead_pt.lat delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat)) if delta_lat**2 + delta_long**2 <= coarse_threshold: break # put vincenty test here #if 0 < vincenty(lead_pt, other_pt).miles <= close_limit: # break else: data[i] = data[i]._replace(has_close_neighbor=False) print() print(time.time() - start)

+3

Roottwo Mar 14 '16 at 2:40

source share

If you sort the list by latitude (n log (n)), and the points are distributed evenly, it will be reduced to 1000 points within 5 miles for each point (mathematical mathematics, not exactly). Considering only points near latitude, the runtime is from n ^ 2 to n * log (n) + 0004n ^ 2. I hope this speeds it up.

+2

Oscar smith Mar 14 '16 at 2:20

source share

I would repeat the algorithm in three steps:

Use the distance with a large circle and accept a 1% error, so make the limit equal to 1.01 * limit.
Distance with a large range of codes, as a built-in function, this test should be fast
You will get some false positives that you could additionally check with vincenty

+1

Severin pappadeux Mar 14 '16 at 2:20

source share

I would give pandas . Pandas is designed to efficiently process large amounts of data. This can help with the effectiveness of the CSV part. But because of its sounds, you have an ineffective problem in itself to solve. You take point 1 and compare it with 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. These are 12.5 trillion comparisons that you make. If you can do 1,000,000 comparisons per second, then 144 days of calculations. If you can do 10,000,000 comparisons per second, then 14 days. For simple adding in direct python, 10,000,000 operations may take about 1.1 seconds, but I doubt your comparisons are as fast as the adding operation. So give him at least two weeks or two.

Alternatively, you can find an alternative algorithm, although I do not mean specific.

+1

Mike lane Mar 14 '16 at 2:32

source share

This is just the first pass, but I speed it up to half using great_circle() instead of vincinty() , and clear a couple of other things. The difference is explained here , and the accuracy loss is 0.17% :

 from geopy.point import Point from geopy.distance import great_circle import csv class CustomGeoPoint(Point): def __init__(self, latitude, longitude): super(CustomGeoPoint, self).__init__(latitude, longitude) self.close_to_another_point = False def isCloseToAnother(pointA, points): for pointB in points: dist = great_circle(pointA, pointB).miles if 0 < dist <= CLOSE_LIMIT: # (0, close_limit] return True return False with open('geo_input.csv', 'r') as geo_csv: reader = csv.reader(geo_csv) next(reader, None) # skip the headers geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader)) with open('output.csv', 'w') as output: writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL) writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint']) # for every point, look at every point until one is found within a mile for point in geo_points: point.close_to_another_point = isCloseToAnother(point, geo_points) writer.writerow([point.latitude, point.longitude, point.close_to_another_point])

I am going to improve it further.

Before:

 $ time python geo.py real 0m5.765s user 0m5.675s sys 0m0.048s

After:

 $ time python geo.py real 0m2.816s user 0m2.716s sys 0m0.041s

+1

Will Mar 14 '16 at 3:05

source share

The best solution created by Oscar Smith. You have a csv file and it is just sorted in excel, it is very efficient). Then use the binary search in your program to find cities within 5 miles (you can make a small change to the binary search method so that it breaks if it finds one city that suits your state). Another improvement is to set up a map to remember a couple of cities when you find one city in another. For example, when you find city A within 5 miles of city B, use Map to store the pair (B is the key and A is the value). Therefore, the next time you meet B, first find it on the Map, if it has an appropriate value, you do not need to check it again. But he can use more memory, so take care of that. Hope this helps you.

+1

Adam lyu Mar 14 '16 at 3:12

source share

This problem can be solved using the VP tree . They allow you to query data with distances, which are metrics that obey the triangle inequality.

The big advantage of VP trees over the kD tree is that they can be blind in relation to geographic data anywhere in the world without worrying about projecting onto a suitable 2D space. In addition, a true geodesic can use distance (no need to worry about the differences between geodesic distances and projection distances).

Here is my test: generate 5 million points randomly and evenly on the World. Place them in the VP tree.

Crossing all the points, query the VP tree to find any neighbor a distance (0km, 10km). (0 km is not included in this set to avoid a found query point.) Count the number of points without such a neighbor (which in my case is 229573).

The cost of setting up distance calculations VP = 5,000,000 * 20.

Cost of inquiries = 5,000,000 * 23 calculation of distances.

The time for installation and queries is 5 m 7 s.

I use C ++ with GeographicLib to calculate distances, but Of course, the algorithm can be implemented in any language, and the python version of GeographicLib .

ADD . C ++ code implementing this approach is provided here .

+1

cffk Mar 27 '16 at 9:04

source share

Austin hastings · Accepted Answer · 2016-03-14T04:46:54+0000

See what you do.

You read all the points in the list named geo_points .
Now, can you tell me if the list is sorted? Because if it was sorted, we definitely want to know. Sorting is valuable information, especially when you are dealing with 5 million things.
You geo_points over all geo_points . This is 5 million, in your opinion.
Inside the outer loop, you turn on all 5 million geo_points .
You calculate the distance in miles between two elements of the cycle.
If the distance is less than your threshold, you write this information to the first point and stop the inner loop.
When the inner loop stops, you write information about the outer loop element to the CSV file.

Pay attention to a few things. First, you loop 5 million times in the outer loop. And then you loop 5 million times in the inner loop.

This is what O (n²) means.

The next time you see someone talking about "Oh, this is O (log n), but the other thing is O (n log n)," remember this experience - you use the n² algorithm, where n in this case is 5,000,000. Sucks, dunnit?

Anyway, you have some problems.

Problem 1: Ultimately, you will end up comparing each point with yourself. Which should have a zero distance, that is, they will all be marked as within any distance threshold. If your program ever ends, all cells will be marked True.

Problem 2: When you compare point # 1 with, say, point # 12345, and are within a threshold distance from each other, you record this information about point # 1. But you do not record the same information about another point. You know that point number 12345 (geo_point2) is reflectively within the threshold of point number 1, but you are not recording it. Thus, you do not have a chance to miss more than 5 million comparisons.

Problem 3: If you compare points No. 1 and point No. 2, and they are not within the threshold distance, what happens when you compare point No. 2 with point No. 1? Your inner loop starts at the beginning of the list every time, but you know that you have already compared the beginning of the list to the end of the list. You can halve the problem space by simply making your outer loop go i in range(0, 5million) and your inner loop j in range(i+1, 5million) .

The answers

Consider your latitude and longitude on a flat plane. You want to know if there is a point within a radius of 5 miles. Think of a 10-kilometer square centered on your No. 1 point. This is a square centered (X1, Y1) with the upper left corner (X1 - 5miles, Y1 + 5miles) and the lower right corner (X1 + 5miles, Y1 - 5miles). Now, if the point is within this square, it may not be within 5 miles of your No. 1 point. But you can bet that if it is outside this square, it is more than 5 miles from the hotel.

As @SeverinPappadeaux points out, the distance on a spheroid like Earth is not quite the same as the distance on a flat plane. But what? Set the square a bit larger to account for the difference, and go on!

Sorted list

This is why sorting is important. If all points were sorted by X, then Y (or Y, then X - independently), and you knew that, you could really speed things up. Since you could just stop scanning when the X (or Y) coordinate got too big and you would not need to go through 5 million points.

How will it work? Same as before, except that the inner loop will have the following checks:

 five_miles = ... # Whatever math, plus an error allowance! list_len = len(geo_points) # Don't call this 5 million times for i, pi in enumerate(geo_points): if pi.close_to_another_point: continue # Remember if close to an earlier point pi0max = pi[0] + five_miles pi1min = pi[1] - five_miles pi1max = pi[1] + five_miles for j in range(i+1, list_len): pj = geo_points[j] # Assumes geo_points is sorted on [0] then [1] if pj[0] > pi0max: # Can't possibly be close enough, nor any later points break if pj[1] < pi1min or pj[1] > pi1max: # Can't be close enough, but a later point might be continue # Now do "real" comparison using accurate functions. if ...: pi.close_to_another_point = True pj.close_to_another_point = True break

What am I doing there? First, I get some numbers in local variables. Then I use enumerate to give me the value of i and a reference to an external point. (What you called geo_point ). Then I quickly check if we know that this moment is close to another.

If not, we will have to scan. Therefore, I only look at the “later” points in the list, because I know that the outer loop looks at the earlier ones, and I definitely do not want to compare the point with myself. I use several temporary variables to cache the results of calculations associated with the outer loop. Inside the inner loop, I make silly comparisons to temporary ones. They cannot tell me if the two points are close to each other, but I can check if they are definitely not close and skip ahead.

Finally, if simple checks pass, go ahead and complete costly checks. If the test really passes, be sure to record the result at both points, so we can skip the second part later.

Unsorted List

But what if the list is not sorted?

@RootTwo points you to the tree kD (where D is for "dimension" and k in this case is "2"). The idea is very simple if you already know about binary search trees: you look at dimensions, compare X at even levels in a tree and compare Y at odd levels (or vice versa). The idea would be this:

 def insert_node(node, treenode, depth=0): dimension = depth % 2 # even/odd -> lat/long dn = node.coord[dimension] dt = treenode.coord[dimension] if dn < dt: # go left if treenode.left is None: treenode.left = node else: insert_node(node, treenode.left, depth+1) else: # go right if treenode.right is None: treenode.right = node else: insert_node(node, treenode.right, depth+1)

What would it do? This will give you a searchable tree where points can be inserted in O (log n). This means that O (n log n) for the whole list, which is better than n squares! (A base base of 2 out of 5 million is basically 23. So, n log n 5 million times 23, compared with 5 million times 5 million!)

It also means that you can perform a targeted search. Since the tree is ordered, it’s pretty simple to look for “close” points (the Wikipedia link from @RootTwo provides an algorithm).

Tip

My advice is simply to write code to sort the list if necessary. This is easier to write, and easier to check manually, and this is a separate passage that you will only need to do once.

After sorting the list, try the approach shown above. This is close to what you did, and it should be easy for you to understand and code.

Faster processing of 5 million lines of coordinate data

This is what O (n²) means.

More articles: