Track class-level Python 2.7.x object attributes to quickly create a numpy array

Let's say we have a list of class instances, all of which have an attribute, which, as we know, is a float - call the x attribute. At different points in the program, we want to extract a numpy array of all x values ​​to perform some analysis of the distribution of x. This extraction process is very busy and has been identified as the slowest part of the program. Here is a very simple example illustrating what I mean:

import numpy as np # Create example object with list of values class stub_object(object): def __init__(self, x): self.x = x # Define a list of these fake objects stubs = [stub_object(i) for i in range(10)] # ...much later, want to quickly extract a vector of this particular attribute: numpy_x_array = np.array([a_stub.x for a_stub in stubs]) 

This begs the question: is there a smart and quick way to track the “x” attribute in stub_object instances in the stub list so that building “numpy_x_array” is faster than the process above?

Here's a gross idea I'm trying to pull: can I create a numpy vector "global to the class type" that will update as a collection of object updates, but can I work efficiently anytime I want?

All I'm really looking for is a "push in the right direction." Providing Keywords I can google / search SO / docs further exactly what I am looking for.

For what it's worth, I looked through them, which got me a little further, but not completely there:

Others that I looked at were not so helpful:

(One option, of course, is to “simply” rebuild the code structure, so instead of a list of “stubs” from “stub_objects” there is one large object, something like stub_population, which supports the corresponding attributes in numpy lists and / or arrays as well as methods that simply act on the elements of these arrays. The disadvantage of this is a lot of refactoring and some reduction in the abstraction and flexibility of modeling "stub_object" as your own thing. d would like to avoid this if there is a smart way to do it.)

Edit : I am using 2.7.x

Edit 2: @hpaulj, your example was a big help - the answer is accepted.

Here's a very simple first version of the example code above that does what I want. There are very preliminary indications of a possible acceleration of the order, without a significant restructuring of the code body. Excellent. Thanks!

 size = 20 # Create example object with list of values class stub_object(object): _x = np.zeros(size, dtype=np.float64) def __init__(self, x, i): # A quick cop-out for expanding the array: if i >= len(self._x): raise Exception, "Index i = " +str(i)+ " is larger than allowable object size of len(self._x) = "+ str(self._x) self.x = self._x[i:i+1] self.set_x(x) def get_x(self): return self.x[0] def set_x(self, x_new): self.x[0] = x_new # Examine: # Define a list of these fake objects stubs = [stub_object(x=i**2, i) for i in range(size)] # ...much later, want to quickly extract a vector of this particular attribute: #numpy_x_array = np.array([a_stub.x for a_stub in stubs]) # Now can do: numpy_x_array = stub_object._x # or numpy_x_array = stubs[0]._x # if need to use the list to access 

Do not use properties yet, but really like this idea, and it must go a long way, making the code very close to immutable.

+2
source share
1 answer

The main problem is that your objects are stored out of memory, with an attribute in each object dictionary. But for the array to work, the values ​​must be stored in a continuous databuffer.

I studied this on other SO issues, but the ones you found were before. Nevertheless, I have nothing to add.

 np.array([a_stub.x for a_stub in stubs]) 

Alternatives using itertools or fromiter should not change the speed much, because the a_stub.x time a_stub.x gets access not so much to the iteration mechanism. You can verify that by checking for something simpler, like

 np.array([1 for _ in range(len(stubs))] 

I suspect the best option is to use one or more arrays as the main repository and refactor your class so that the attribute is retrieved from that repository.

If you know that you will have 10 objects, then create an empty array of this size. When you create an object, you assign it a unique index. The x attribute can be property , which getter / setter refers to the data[i] element of this array. Having made the x property instead of the main attribute, you should be able to store most of the object's machinery. And you can experiment with different storage methods by simply changing a few methods.

I tried to sketch this using the class attribute as the main storage of the array, but I still have some errors.


A class with an x attribute that accesses an array:

 class MyObj(object): xdata = np.zeros(10) def __init__(self,idx, x): self._idx = idx self.set_x(x) def set_x(self,x): self.xdata[self._idx] = x def get_x(self): return self.xdata[self._idx] def __repr__(self): return "<obj>x=%s"%self.get_x() x = property(get_x, set_x) In [67]: objs = [MyObj(i, 3*i) for i in range(10)] In [68]: objs Out[68]: [<obj>x=0.0, <obj>x=3.0, <obj>x=6.0, ... <obj>x=27.0] In [69]: objs[3].x Out[69]: 9.0 In [70]: objs[3].xdata Out[70]: array([ 0., 3., 6., 9., 12., 15., 18., 21., 24., 27.]) In [71]: objs[3].xdata += 3 In [72]: [ox for o in objs] Out[72]: [3.0, 6.0, 9.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0] 

In place of changing the array is the easiest. But you can also replace the array itself (and thus "grow" a set of classes)

 In [79]: MyObj.xdata=np.ones((20,)) In [80]: a = MyObj(11,25) In [81]: a Out[81]: <obj>x=25.0 In [82]: MyObj.xdata Out[82]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 25., 1., 1., 1., 1., 1., 1., 1., 1.]) In [83]: [ox for o in objs] Out[83]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 

We must be careful about attribute modification. For example, I tried

 objs[3].xdata += 3 

intending to change xdata for the whole class. But this led to the appointment of a new xdata array only for this object. We should also be able to automatically increase the index of the object (these days I am more familiar with numpy methods than Python class structures).


If I replaced getter with one that retrieves the slice:

  def get_x(self): return self.xdata[self._idx:self._idx+1] In [107]: objs=[MyObj(i,i*3) for i in range(10)] In [109]: objs Out[109]: [<obj>x=[ 0.], <obj>x=[ 3.], ... <obj>x=[ 27.]] 

np.info (or .__array_interface__ ) gives me information about the xdata array, including its pointer to the databuffer:

 In [110]: np.info(MyObj.xdata) class: ndarray shape: (10,) strides: (8,) itemsize: 8 aligned: True contiguous: True fortran: True data pointer: 0xabf0a70 byteorder: little byteswap: False type: float64 

The slice for the first object points to the same place:

 In [111]: np.info(objs[0].x) class: ndarray shape: (1,) strides: (8,) itemsize: 8 .... data pointer: 0xabf0a70 ... 

The following object points to the following float (another 8 bytes):

 In [112]: np.info(objs[1].x) class: ndarray shape: (1,) ... data pointer: 0xabf0a78 .... 

I'm not sure if access with slice / view is worth it or not.

+3
source

Source: https://habr.com/ru/post/1267018/


All Articles