The fastest way to increment a numeric numeric array

Requirements:

  • I need to assemble an array arbitrarily large from the data.
  • I can guess the size (approximately 100-200) without any guarantee that the array will fit every time
  • Once it grows to its final size, I need to do numerical calculations on it, so I would prefer to end up moving to an array with two sizes.
  • Speed ​​is critical. For example, for one of the 300 files, the update () method is called 45 million times (takes 150 s or so), and the finalize () method is called 500k times (takes a total of 106 seconds) ... a total of 250 or so.

Here is my code:

def __init__(self): self.data = [] def update(self, row): self.data.append(row) def finalize(self): dx = np.array(self.data) 

Other things I tried include the following code ... but this is waaaaay slower.

 def class A: def __init__(self): self.data = np.array([]) def update(self, row): np.append(self.data, row) def finalize(self): dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5)) 

Here is a diagram of what it's called:

 for i in range(500000): ax = A() for j in range(200): ax.update([1,2,3,4,5]) ax.finalize() # some processing on ax 
+45
performance python numpy
Aug 20 '11 at 18:45
source share
5 answers

I tried a few different things, given the time.

 import numpy as np 
  • The method you specify as slow: (32.094 seconds)

     class A: def __init__(self): self.data = np.array([]) def update(self, row): self.data = np.append(self.data, row) def finalize(self): return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5)) 
  • Plain python list: (0.308 seconds)

     class B: def __init__(self): self.data = [] def update(self, row): for r in row: self.data.append(r) def finalize(self): return np.reshape(self.data, newshape=(len(self.data)/5, 5)) 
  • Trying to implement arraylist in numpy: (0.362 seconds)

     class C: def __init__(self): self.data = np.zeros((100,)) self.capacity = 100 self.size = 0 def update(self, row): for r in row: self.add(r) def add(self, x): if self.size == self.capacity: self.capacity *= 4 newdata = np.zeros((self.capacity,)) newdata[:self.size] = self.data self.data = newdata self.data[self.size] = x self.size += 1 def finalize(self): data = self.data[:self.size] return np.reshape(data, newshape=(len(data)/5, 5)) 

And this is how I timed it:

 x = C() for i in xrange(100000): x.update([i]) 

So it looks like plain old Python lists are pretty good;)

+50
Aug 20 '11 at 19:08
source share

np.append () copies all the data in the array each time, but the list increases the capacity by 1.125 times. the list is quick, but memory usage is more than an array. You can use the array module of the python standard library if you need memory.

Here is a discussion of this topic:

How to create a dynamic array

+14
Aug 20 '11 at 23:10
source share

Using class announcements in Owen mail, here is a revised time with some finalization effect.

In short, I find class C to provide an implementation that is 60 times faster than the method in the original message. (apologies for the text wall)

The file I used:

 #!/usr/bin/python import cProfile import numpy as np # ... class declarations here ... def test_class(f): x = f() for i in xrange(100000): x.update([i]) for i in xrange(1000): x.finalize() for x in 'ABC': cProfile.run('test_class(%s)' % x) 

Now the resulting resulting timings:

  903005 function calls in 16.049 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 16.049 16.049 <string>:1(<module>) 100000 0.139 0.000 1.888 0.000 fromnumeric.py:1043(ravel) 1000 0.001 0.000 0.003 0.000 fromnumeric.py:107(reshape) 100000 0.322 0.000 14.424 0.000 function_base.py:3466(append) 100000 0.102 0.000 1.623 0.000 numeric.py:216(asarray) 100000 0.121 0.000 0.298 0.000 numeric.py:286(asanyarray) 1000 0.002 0.000 0.004 0.000 test.py:12(finalize) 1 0.146 0.146 16.049 16.049 test.py:50(test_class) 1 0.000 0.000 0.000 0.000 test.py:6(__init__) 100000 1.475 0.000 15.899 0.000 test.py:9(update) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 100000 0.126 0.000 0.126 0.000 {method 'ravel' of 'numpy.ndarray' objects} 1000 0.002 0.000 0.002 0.000 {method 'reshape' of 'numpy.ndarray' objects} 200001 1.698 0.000 1.698 0.000 {numpy.core.multiarray.array} 100000 11.915 0.000 11.915 0.000 {numpy.core.multiarray.concatenate} 208004 function calls in 16.885 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 16.885 16.885 <string>:1(<module>) 1000 0.025 0.000 16.508 0.017 fromnumeric.py:107(reshape) 1000 0.013 0.000 16.483 0.016 fromnumeric.py:32(_wrapit) 1000 0.007 0.000 16.445 0.016 numeric.py:216(asarray) 1 0.000 0.000 0.000 0.000 test.py:16(__init__) 100000 0.068 0.000 0.080 0.000 test.py:19(update) 1000 0.012 0.000 16.520 0.017 test.py:23(finalize) 1 0.284 0.284 16.883 16.883 test.py:50(test_class) 1000 0.005 0.000 0.005 0.000 {getattr} 1000 0.001 0.000 0.001 0.000 {len} 100000 0.012 0.000 0.012 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000 0.020 0.000 0.020 0.000 {method 'reshape' of 'numpy.ndarray' objects} 1000 16.438 0.016 16.438 0.016 {numpy.core.multiarray.array} 204010 function calls in 0.244 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.244 0.244 <string>:1(<module>) 1000 0.001 0.000 0.003 0.000 fromnumeric.py:107(reshape) 1 0.000 0.000 0.000 0.000 test.py:27(__init__) 100000 0.082 0.000 0.170 0.000 test.py:32(update) 100000 0.087 0.000 0.088 0.000 test.py:36(add) 1000 0.002 0.000 0.005 0.000 test.py:46(finalize) 1 0.068 0.068 0.243 0.243 test.py:50(test_class) 1000 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000 0.002 0.000 0.002 0.000 {method 'reshape' of 'numpy.ndarray' objects} 6 0.001 0.000 0.001 0.000 {numpy.core.multiarray.zeros} 

Class A is destroyed by updates, Class B is destroyed by finalization. Class C is reliable in front of both of them.

+7
Aug 21 2018-11-11T00:
source share

There is a big difference in performance in the function that you use to complete. Consider the following code:

 N=100000 nruns=5 a=[] for i in range(N): a.append(np.zeros(1000)) print "start" b=[] for i in range(nruns): s=time() c=np.vstack(a) b.append((time()-s)) print "Timing version vstack ",np.mean(b) b=[] for i in range(nruns): s=time() c1=np.reshape(a,(N,1000)) b.append((time()-s)) print "Timing version reshape ",np.mean(b) b=[] for i in range(nruns): s=time() c2=np.concatenate(a,axis=0).reshape(-1,1000) b.append((time()-s)) print "Timing version concatenate ",np.mean(b) print c.shape,c2.shape assert (c==c2).all() assert (c==c1).all() 

Using concatenate seems to be twice as fast as the first version and more than 10 times faster than the second version.

 Timing version vstack 1.5774928093 Timing version reshape 9.67419199944 Timing version concatenate 0.669512557983 
+1
Mar 06 '13 at 12:23
source share

If you want to improve performance with lists, look at the blist library. This is an optimized implementation of the python list and other structures.

I haven't rated it yet, but the results on their page seem promising.

+1
Nov 22 '14 at 10:32
source share



All Articles