Numpy Arrays: Filling and Retrieving Data Quickly

Question

Numpy Arrays: Filling and Retrieving Data Quickly

See the important clarification at the bottom of this question.

I use numpy to speed up the processing of longitude / latitude coordinates. Unfortunately, my “optimization” in numpy made my code run about 5 times slower than it worked without using numpy.

The bottleneck seems to fill the numpy array with my data, and then retrieve that data after performing the math transformations. To populate the array, I basically have a loop like:

point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs n = len( point_list ) point_buffer = numpy.empty( ( n, 2 ), numpy.float32 ) for point_index in xrange( 0, n ): point_buffer[ point_index ] = point_list[ point_index ]

This loop, just filling the numpy array, doesn't even work on it, very slow, much slower than all the calculations were without numpy. (That is, it is not only the slowness of the python cycle itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) At the other end, there is a similar slowness; after I have processed the numpy arrays, I refer to each modified coordinate pair in the loop, again as

 some_python_tuple = point_buffer[ index ]

Again, this data output loop is much slower than all the original calculations without numpy. So, how do I actually populate the numpy array and extract data from the numpy array in a way that does not defeat the purpose of using numpy in the first place?

I am reading data from a form file using the C library, which passes me the data as a regular python list. I understand that if the library gave me the coordinates already in the numpy array, there would be no “filling” of the numpy array. But unfortunately, the starting point for me with data is the regular python list. And what's more, in general, I want to understand how you quickly populate a numpy array with data from python.

Explanation

The cycle shown above is actually simplified. I wrote it like this in this question because I wanted to focus on the problem I saw while trying to fill a numpy array slowly in a loop. Now I understand that doing this is just slow.

In my actual application, I have a coordinate point shape file, and I have an API to extract points for a given object. There is something like 200,000 objects. Therefore, I repeatedly call the GetShapeCoords( i ) function to get the coordinates for object i. This returns a list of lists, where each sublist is a list of lon / lat pairs, and the reason the list of lists is because some of the objects are multi-part (i.e. multi-poly). Then, in my source code, when I read in every object, I did the conversion at every point, calling the regular python function, and then drawing the converted points using PIL. It all took about 20 seconds to draw all 200,000 polygons. Not scary, but plenty of room for improvement. I noticed that at least half of those 20 seconds was spent executing the conversion logic, so I thought I would do it in a few words. And my initial implementation was just to read objects one at a time, and also add all the points from the sub-lists into one large numpy array, which then I could use the math material in numpy.

So now I understand that just passing the complete python list to numpy is the right way to configure a large array. But in my case, I read only one object at a time. So the one thing I could do is to add add points to the large list of python list of list lists. And then, when I collected several points of objects in this way (say, 10,000 objects), I could just assign numpy to this list of monsters.

So now my question has three parts:

(a) Is it true that numpy can accept this large list of irregular shapes, a list of lists of lists, both quickly and quickly?

(b) Then I want to be able to transform all the points in the leaves of this monster tree. What is an expression to get numpy, for example, "go to each sublist, and then to each sub-sub, and then for each coordinate pair that you find in these sub-tabs, multiply the first (the lon coordinate) by 0.5"? Can I do it?

(c) Finally, I need to return these transformed coordinates to build them.

Below is an example of a Winston answer that tells me how I can do this using itertools. What I want to do is very similar to what Winston does by smoothing the list. But I can’t just smooth it out. When I go to draw data, I need to know when one polygon stops and the next starts. So, I think I could make it work if there was a way to quickly mark the end of each polygon (i.e., each sub-sub) with a special coordinate pair like (-1000, -1000) or something like that. Then I could smooth itertools, as in Winston's answer, and then do the conversion to numpy. Then I need to draw from point to point using PIL, and here I think that I will need to reassign the modified numpy array back to the python list, and then iterate this list into a regular python loop to make a drawing. Does this seem to be my best option, not to mention writing a C module to handle all reading and drawing for me in one step?

+4

performance python arrays numpy loading

M katz Apr 05 '11 at 23:40

source share

3 answers

It will be faster:

 numpy.array(point_buffer, dtype=numpy.float32)

Modify the array, not the list. Obviously, it would be better to avoid creating a list if possible.

Edit 1: profiling

Here are some test codes that demonstrate how efficiently numpy converts lists to arrays (which is good). And that my idea of a list-buffer is comparable to what numpy does, and not better.

 import timeit setup = ''' import numpy import itertools import struct big_list = numpy.random.random((10000,2)).tolist()''' old_way = ''' a = numpy.empty(( len(big_list), 2), numpy.float32) for i,e in enumerate(big_list): a[i] = e ''' normal_way = ''' a = numpy.array(big_list, dtype=numpy.float32) ''' iter_way = ''' chain = itertools.chain.from_iterable(big_list) a = numpy.fromiter(chain, dtype=numpy.float32) ''' my_way = ''' chain = itertools.chain.from_iterable(big_list) buffer = struct.pack('f'*len(big_list)*2,*chain) a = numpy.frombuffer(buffer, numpy.float32) ''' for way in [old_way, normal_way, iter_way, my_way]: print timeit.Timer(way, setup).timeit(1)

results:

 0.22445492374 0.00450378469941 0.00523579114088 0.00451488946237

Edit 2: Regarding the hierarchical nature of the data

If I understand that the data is always a list of lists of lists (object - polygon-coord), then this will be my approach: reduce the data to the lowest size creating a square array (2D in this case) and track the indexes of higher-level branches with separate array. This is essentially an implementation of Winston’s idea of using the numpy.fromiter object of the itertools chain. The only idea added is branch indexing.

 import numpy, itertools # heirarchical list of lists of coord pairs polys = [numpy.random.random((n,2)).tolist() for n in [5,7,12,6]] # get the indices of the polygons: lengs = numpy.array([0]+[len(l) for l in polys]) p_idxs = numpy.add.accumulate(lengs) # convert the flattend list to an array: chain = itertools.chain.from_iterable a = numpy.fromiter(chain(chain(polys)), dtype=numpy.float32).reshape(lengs.sum(), 2) # transform the coords a *= .5 # get a transformed polygon (using the indices) def get_poly(n): i0 = p_idxs[n] i1 = p_idxs[n+1] return a[i0:i1] print 'poly2', get_poly(2) print 'poly0', get_poly(0)

+2

Paul Apr 05 '11 at 23:58

source share

The point of using numpy arrays is to avoid as many loops as possible. Writing for loops yourself will lead to slow code, but with numpy arrays you can use predefined vectorized functions that are much faster (and easier!).

So, to convert a list to an array, you can use:

 point_buffer = np.array(point_list)

If the list contains elements such as (lat, lon) , then it will be converted to an array with two columns.

With this numpy array, you can easily control all elements at once. For example, to multiply the first element of each coordinate pair by 0.5, as in your question, you can do simply (provided that the first elements, for example, in the first column):

 point_buffer[:,0] * 0.5

+2

joris Apr 6 '11 at 0:14

source share

Winston ewert · Accepted Answer · 2011-04-06T01:45:25+0000

You describe your data as "lists of lists of coordinates lists." From this, I assume that your extraction looks like this:

 for x in points: for y in x: for Z in y: # z is a tuple with GPS coordinates

Do it:

 # initially, points is a list of lists of lists points = itertools.chain.from_iterable(points) # now points is an iterable producing lists points = itertools.chain.from_iterable(points) # now points is an iterable producing coordinates points = itertools.chain.from_iterable(points) # now points is an iterable producing individual floating points values data = numpy.fromiter(points, float) # data is a numpy array containing all the coordinates data = data.reshape( data.size/2,2) # data has now been reshaped to be an nx2 array

itertools and numpy.fromiter are implemented in c and are really efficient. As a result, this should make the conversion very fast.

The second part of your question does not indicate what you want to do with the data. Indexing a numpy array is slower than indexing python lists. You get speed by doing bulk data operations. Without knowing more about what you are doing with this data, it’s hard to suggest how to fix it.

UPDATE:

I went ahead and did everything using itertools and numpy. I am not responsible for any brain damage caused by an attempt to understand this code.

 # firstly, we use imap to call GetMyPoints a bunch of times objects = itertools.imap(GetMyPoints, xrange(100)) # next, we use itertools.chain to flatten it into all of the polygons polygons = itertools.chain.from_iterable(objects) # tee gives us two iterators over the polygons polygons_a, polygons_b = itertools.tee(polygons) # the lengths will be the length of each polygon polygon_lengths = itertools.imap(len, polygons_a) # for the actual points, we'll flatten the polygons into points points = itertools.chain.from_iterable(polygons_b) # then we'll flatten the points into values values = itertools.chain.from_iterable(points) # package all of that into a numpy array all_points = numpy.fromiter(values, float) # reshape the numpy array so we have two values for each coordinate all_points = all_points.reshape(all_points.size // 2, 2) # produce an iterator of lengths, but put a zero in front polygon_positions = itertools.chain([0], polygon_lengths) # produce another numpy array from this # however, we take the cumulative sum # so that each index will be the starting index of a polygon polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) ) # now for the transformation # multiply the first coordinate of every point by *.5 all_points[:,0] *= .5 # now to get it out # polygon_positions is all of the starting positions # polygon_postions[1:] is the same, but shifted on forward, # thus it gives us the end of each slice # slice makes these all slice objects slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:])) # polygons produces an iterator which uses the slices to fetch # each polygon polygons = itertools.imap(all_points.__getitem__, slices) # just iterate over the polygon normally # each one will be a slice of the numpy array for polygon in polygons: draw_polygon(polygon)

You may find it best to deal with one polygon at a time. Convert each polygon into a numpy array and perform vector operations on it. You will probably get a significant speed advantage just by doing this. Putting all your data in numpy can be a bit complicated.

This is harder than most countless things because of your weird data. Unnecessary pretty much involves a world of homogeneous data.

Numpy Arrays: Filling and Retrieving Data Quickly

More articles: