Python performance: iteration and operations on nested lists

Question

Python performance: iteration and operations on nested lists

Problem Hi guys. I am looking for some python performance recommendations. Some information about my problem:

Considering:

A (x,y) grid of nodes, each with a value (0...255) , starting from 0
List of N input coordinates of each at a specified location in the range (0...x, 0...y)
The Z value that defines the "neighborhood" in counting nodes

Increase the value of the node at the input coordinate and the neighbors of the node. Neighbors outside the edge of the grid are ignored. (Without packaging)

BASE CASE: a grid of nodes of size 1024x1024 , with input coordinates of 400 and a range of Z nodes of 75 .

Processing should be O(x*y*Z*N) . I expect x, y, and Z to remain roughly around the values in the base case, but the number of input N coordinates can increase to 100,000. My goal is to minimize processing time.

Current Results Between my start and the comments below, we have several implementations.

Speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:

  f1: 2.819s f2: 1.567s f3: 1.593s f: 1.579s f3b: 1.526s f4: 0.978s

f1 is the initial naive implementation: three nested for loops. f2 replaces the inner for loop with list comprehension. f3 based on Andrei’s assumption in the comments and replaces the external for with map() f - Chris' suggestion in the answers below f3b takes the value kriss f3 f4 - Alex's contribution.

The code below is for your reading.

Question How to reduce processing time? I would prefer sub-1.0s for test parameters.

Please keep recommendations for native Python. I know that I can switch to a third-party package, for example numpy , but I try to avoid third-party packages. In addition, I created random input coordinates and simplified the definition of node value updates to make our discussion simple. The specifics should change a bit and go beyond the scope of my question.

Thanks a lot!

f1 is the initial naive implementation: three nested for loops.

 def f1(x,y,n,z): rows = [[0]*x for i in xrange(y)] for i in range(n): inputX, inputY = (int(x*random.random()), int(y*random.random())) topleft = (inputX - z, inputY - z) for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)): for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)): if rows[i][j] <= 255: rows[i][j] += 1

f2 replaces the inner for loop with list comprehension.

 def f2(x,y,n,z): rows = [[0]*x for i in xrange(y)] for i in range(n): inputX, inputY = (int(x*random.random()), int(y*random.random())) topleft = (inputX - z, inputY - z) for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)): l = max(0, topleft[1]) r = min(topleft[1]+(z*2), y) rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

UPDATE: f3 based on Andrei in the comments and replaces the external for with map() . My first hack requires several queries outside the local scale, in particular, those recommended against by Guido: local variables look much faster than global or built-in variable search. I programmed everything except the reference to the main data structure to minimize this overhead.

 rows = [[0]*x for i in xrange(y)] def f3(x,y,n,z): inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)] rows = map(g, inputs) def g(input): inputX, inputY = input topleft = (inputX - 75, inputY - 75) for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)): l = max(0, topleft[1]) r = min(topleft[1]+(75*2), 1024) rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

UPDATE3: ChristopeD also pointed to several improvements.

 def f(x,y,n,z): rows = [[0] * y for i in xrange(x)] rn = random.random for i in xrange(n): topleft = (int(x*rn()) - z, int(y*rn()) - z) l = max(0, topleft[1]) r = min(topleft[1]+(z*2), y) for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)): rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]

UPDATE4: kriss added several improvements to f3 , replacing min / max with the new ternary operator syntax.

 def f3b(x,y,n,z): rn = random.random rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]] def g1(x, y, z): l = y - z if y - z > 0 else 0 r = y + z if y + z < 1024 else 1024 for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ): rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

UPDATE5: Alex , with its significant revision, adding a separate map() operation to limit the values to 255 and delete all non-local queries. Significant differences are nontrivial.

 def f4(x,y,n,z): rows = [[0]*y for i in range(x)] rr = random.randrange inc = (1).__add__ sat = (0xff).__and__ for i in range(n): inputX, inputY = rr(x), rr(y) b = max(0, inputX - z) t = min(inputX + z, x) l = max(0, inputY - z) r = min(inputY + z, y) for i in range(b, t): rows[i][l:r] = map(inc, rows[i][l:r]) for i in range(x): rows[i] = map(sat, rows[i])

Also, since we all seem to be hacking options, here is my test harness for comparing speeds: (improved by ChristopheD)

 def timing(f,x,y,z,n): fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n) ctx = "from __main__ import %s" % f.__name__ results = timeit.Timer(fn, ctx).timeit(10) return "%4.4s: %.3f" % (f.__name__, results / 10.0) if __name__ == "__main__": print timing(f, 1024, 1024, 400, 75) #add more here.

+4

performance python

Jj Mar 21 '10 at 20:49

source share

5 answers

1 . A (smaller) acceleration can definitely be the initialization of your rows ...

Replace

 rows = [] for i in range(x): rows.append([0 for i in xrange(y)])

from

 rows = [[0] * y for i in xrange(x)]

2 . You can also avoid some searches by moving random.random out of loops (saves a bit).

3. EDIT: after corrections - you can come up with something like this:

 def f(x,y,n,z): rows = [[0] * y for i in xrange(x)] rn = random.random for i in xrange(n): topleft = (int(x*rn()) - z, int(y*rn()) - z) l = max(0, topleft[1]) r = min(topleft[1]+(z*2), y) for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)): rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]

EDIT: some new timings with timeit (10 runs) - this seems to provide only minor accelerations:

 import timeit print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10) print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10) print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)

  f1 21.1669280529
 f2 12.9376120567
 f 11.1249599457

+2

ChristopheD Mar 21 '10 at 10:33

source share

in your rewriting f3, g can be simplified. (May also apply to f4)

You have the following code inside a for loop.

 l = max(0, topleft[1]) r = min(topleft[1]+(75*2), 1024)

However, it seems that these values never change inside the for loop. Therefore, calculate them once, outside the loop.

+1

Ponkadoodle Mar 21 '10 at 22:54

source share

Based on your version of f3, I played with code. Since l and r are constants, you can avoid calculating them in the g1 loop. Also using the new triple if instead of min and max seems to be consistently faster. Also a simplified expression with topleft. On my system, this is about 20% faster using the code below.

 def f3b(x,y,n,z): rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]] def g1(x, y, z): l = y - z if y - z > 0 else 0 r = y + z if y + z < 1024 else 1024 for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ): rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

+1

kriss Mar 21 '10 at 23:24

source share

You can create your own Python module in C and manage performance as you see fit: http://docs.python.org/extending/

0

mkotechno Mar 21 '10 at 22:07

source share

Alex martelli · Accepted Answer · 2010-03-22T00:32:58+0000

On my (slow, first) Macbook Air, 1.6GHz Core 2 Duo, Python 2.5 on MacOSX 10.5, after saving the code in op.py I see the following timings:

 $ python -mtimeit -s'import op' 'op.f1()' 10 loops, best of 3: 5.58 sec per loop $ python -mtimeit -s'import op' 'op.f2()' 10 loops, best of 3: 3.15 sec per loop

So, my car is slower than yours, several times more than 1.9.

The fastest code for this task:

 def f3(x=x,y=y,n=n,z=z): rows = [[0]*y for i in range(x)] rr = random.randrange inc = (1).__add__ sat = (0xff).__and__ for i in range(n): inputX, inputY = rr(x), rr(y) b = max(0, inputX - z) t = min(inputX + z, x) l = max(0, inputY - z) r = min(inputY + z, y) for i in range(b, t): rows[i][l:r] = map(inc, rows[i][l:r]) for i in range(x): rows[i] = map(sat, rows[i])

in which:

 $ python -mtimeit -s'import op' 'op.f3()' 10 loops, best of 3: 3 sec per loop

Thus, a very modest acceleration projected for more than 1.5 seconds on your computer is much higher than 1.0, which you aimed at: - (.

With simple C-coded extensions, exte.c ...:

 #include "Python.h" static PyObject* dopoint(PyObject* self, PyObject* args) { int x, y, z, px, py; int b, t, l, r; int i, j; PyObject* rows; if(!PyArg_ParseTuple(args, "iiiiiO", &x, &y, &z, &px, &py, &rows )) return 0; b = px - z; if (b < 0) b = 0; t = px + z; if (t > x) t = x; l = py - z; if (l < 0) l = 0; r = py + z; if (r > y) r = y; for(i = b; i < t; ++i) { PyObject* row = PyList_GetItem(rows, i); for(j = l; j < r; ++j) { PyObject* pyitem = PyList_GetItem(row, j); long item = PyInt_AsLong(pyitem); if (item < 255) { PyObject* newitem = PyInt_FromLong(item + 1); PyList_SetItem(row, j, newitem); } } } Py_RETURN_NONE; } static PyMethodDef exteMethods[] = { {"dopoint", dopoint, METH_VARARGS, "process a point"}, {0} }; void initexte() { Py_InitModule("exte", exteMethods); }

(note: I did not check it carefully - I think this is not a memory leak due to the correct interaction of referential theft and borrowing, but it must be carefully checked by the code before being put into production ;-), we could do

 import exte def f4(x=x,y=y,n=n,z=z): rows = [[0]*y for i in range(x)] rr = random.randrange for i in range(n): inputX, inputY = rr(x), rr(y) exte.dopoint(x, y, z, inputX, inputY, rows)

and time

 $ python -mtimeit -s'import op' 'op.f4()' 10 loops, best of 3: 345 msec per loop

shows an acceleration of 8-9 times, which should put you in the stadium you want. I saw a comment saying that you do not want a third-party extension, but, well, this is a tiny extension that you could make completely yours ;-). ((I do not know what licensing terms apply to the code in Stack Overflow, but I will be happy to reissue it under the Apache 2 license or the like if you need it ;-)).

Python performance: iteration and operations on nested lists

More articles: