Python / cython / numpy np.nonzero optimization

Question

Python / cython / numpy np.nonzero optimization

I have a piece of code that I'm trying to optimize. Most of the code execution time is accepted cdef np.ndarray index = np.argwhere(array==1) Where an array is numpy - an array of zeros and ones 512x512,512 points. Any thoughts on speeding this up? Using Python 2.7, Numpy 1.8.1

Sphericity function

 def sphericity(self,array): #Pass an mask array (1 are marked, 0 ignored) cdef np.ndarray index = np.argwhere(array==1) cdef int xSize,ySize,zSize xSize,ySize,zSize=array.shape cdef int sa,vol,voxelIndex,x,y,z,neighbors,xDiff,yDiff,zDiff,x1,y1,z1 cdef float onethird,twothirds,sp sa=vol=0 #keep running tally of volume and surface area #cdef int nonZeroCount = (array != 0).sum() #Replaces np.count_nonzero(array) for speed for voxelIndex in range(np.count_nonzero(array)): #for voxelIndex in range(nonZeroCount): x=index[voxelIndex,0] y=index[voxelIndex,1] z=index[voxelIndex,2] #print x,y,z,array[x,y,z] neighbors=0 vol+=1 for xDiff in [-1,0,1]: for yDiff in [-1,0,1]: for zDiff in [-1,0,1]: if abs(xDiff)+abs(yDiff)+abs(zDiff)==1: x1=x+xDiff y1=y+yDiff z1=z+zDiff if x1>=0 and y1>=0 and z1>=0 and x1<xSize and y1<ySize and z1<zSize: #print '-',x1,y1,z1,array[x1,y1,z1] if array[x1,y1,z1]: #print '-',x1,y1,z1,array[x1,y1,z1] neighbors+=1 #print 'had this many neighbors',neighbors sa+=(6-neighbors) onethird=float(1)/float(3) twothirds=float(2)/float(3) sph = ((np.pi**onethird)*((6*vol)**twothirds)) / sa #print 'sphericity',sphericity return sph

Profiling test

 #Imports import pstats, cProfile import numpy as np import pyximport pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":np.get_include()}, reload_support=True) #Generate cython version #Create fake array to calc sphericity fakeArray=np.zeros((512,512,512)) fakeArray[200:300,200:300,200:300]=1 #Profiling stuff cProfile.runctx("sphericity(fakeArray)", globals(), locals(), "Profile.prof") s = pstats.Stats("Profile.prof") s.strip_dirs().sort_stats("time").print_stats()

Profiling Output

 Mon Oct 06 11:49:57 2014 Profile.prof 12 function calls in 4.373 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 3.045 3.045 4.373 4.373 <string>:1(<module>) 1 1.025 1.025 1.025 1.025 {method 'nonzero' of 'numpy.ndarray' objects} 2 0.302 0.151 0.302 0.151 {numpy.core.multiarray.array} 1 0.001 0.001 1.328 1.328 numeric.py:731(argwhere) 1 0.000 0.000 0.302 0.302 fromnumeric.py:492(transpose) 1 0.000 0.000 0.302 0.302 fromnumeric.py:38(_wrapit) 1 0.000 0.000 0.000 0.000 {method 'transpose' of 'numpy.ndarray' objects} 1 0.000 0.000 0.302 0.302 numeric.py:392(asarray) 1 0.000 0.000 0.000 0.000 numeric.py:462(asanyarray) 1 0.000 0.000 0.000 0.000 {getattr} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

+5

python numpy cython

Dbricks Oct 6 '14 at 18:41

source share

2 answers

You can get most of what your code is trying to do from vanilla numpy, without the need for Cython. The main thing is to get an effective way of counting neighbors, what you can do and taken fragments of the mask obtained from your input array. Combining all of this, I think the following does the same as your code, but with much less repetition:

 def sphericity(arr): mask = arr != 0 vol = np.count_nonzero(mask) counts = np.zeros_like(arr, dtype=np.intp) for dim, size in enumerate(arr.shape): slc = (slice(None),) * dim axis_mask = (mask[slc + (slice(None, -1),)] & mask[slc + (slice(1, None),)]) counts[slc + (slice(None, -1),)] += axis_mask counts[slc + (slice(1, None),)] += axis_mask sa = np.sum(6 - counts[counts != 0]) return np.pi**(1./3.)*(6*vol)**(2./3.) / sa

+6

Jaime Oct 6 '14 at 23:19

source share

Blake walsh · Accepted Answer · 2014-10-07T02:20:30+0000

Jaime may have given a good answer, but I will comment on the improvement of the Cython code and the addition of performance comparison.

First, you must use the annotate function, cython -a filename.pyx , this will create an HTML file. Download this in your browser and highlight the “slow” lines in yellow-orange, this indicates where you can improve.

In the annotate, two things are immediately discovered that are easily fixed:

Convert idioms into things cython understands

First, these lines are slow:

  for xDiff in [-1,0,1]: for yDiff in [-1,0,1]: for zDiff in [-1,0,1]:

The reason for this is that Cython does not know how to convert list iteration to pure c-code. It should be turned into equivalent code that Cython can optimize, namely the “in range” form:

  for xDiff in range(-1, 2): for yDiff in range(-1, 2): for zDiff in range(-1, 2):

Array Type for Quick Indexing

The next thing this line is slow:

  if array[x1,y1,z1]:

The reason for this is that array not a type. Because of this, it uses python level indexing rather than level indexing. To fix this, you need to specify an array of type, this can be done as follows:

 def sphericity(np.ndarray[np.uint8_t, ndim=3] array):

It is assumed that the array type is "uint8", replace it with the appropriate type (Note: Cython does not support the type "np.bool", so I use "uint8")

You can also use a memory view, you cannot use numpy functions in a memory view, but you can create a view on an array and then index the view instead of an array:

  cdef np.uint8_t array_view [:, :, :] = array ... if array_view[x1,y1,z1]:

It looks like the kind of memory will be a little faster and make a clear separation between the array (python level calls) and browsing (c level calls). If you do not use numpy functions, you can use memory without problems.

Modify the code to avoid multiple passes through the array

It remains that the calculations of index and nonZeroCount slow, for various reasons, but mainly only refers to the pure size of the data (in fact, iterating over 512 * 512 * 512 elements takes time!) Generally speaking, something that Numpy can do optimized Cython can do faster (usually 2-10 times faster) - numpy just saves you a lot of reinventing the wheel and typing a lot and allows you to think at a higher level (and if you're not also an AC programmer, you might not be able to optimize cython good enough). But in this case it’s easy, you can simply exclude index and nonZeroCount and all related code and just do it:

  for x in range(0, xSize): for y in range(0, ySize): for z in range(0, zSize): if array[x,y,z] == 0: continue ...

This is very fast, since c (which is pure Cython compiles to an impeccable level) has no problem doing billions of operations per second. Excluding the index and nonZeroCount , you essentially save two whole iterations over the entire array, which even at maximum speed require a minimum of about 0.1 seconds each. Even more important is processor caching, the entire array is 128 MB, much larger than the cache processor, so all in one pass uses the processor cache better (multiple passes will not matter much if the arrays fit fully into the CPU cache).

Optimized version

Here is the complete code for my optimized version:

 #cython: boundscheck=False, nonecheck=False, wraparound=False import numpy as np cimport numpy as np def sphericity2(np.uint8_t [:, :, :] array): #Pass an mask array (1 are marked, 0 ignored) cdef int xSize,ySize,zSize xSize=array.shape[0] ySize=array.shape[1] zSize=array.shape[2] cdef int sa,vol,x,y,z,neighbors,xDiff,yDiff,zDiff,x1,y1,z1 cdef float onethird,twothirds,sp sa=vol=0 #keep running tally of volume and surface area for x in range(0, xSize): for y in range(0, ySize): for z in range(0, zSize): if array[x,y,z] == 0: continue neighbors=0 vol+=1 for xDiff in range(-1, 2): for yDiff in range(-1, 2): for zDiff in range(-1, 2): if abs(xDiff)+abs(yDiff)+abs(zDiff)==1: x1=x+xDiff y1=y+yDiff z1=z+zDiff if x1>=0 and y1>=0 and z1>=0 and x1<xSize and y1<ySize and z1<zSize: #print '-',x1,y1,z1,array[x1,y1,z1] if array[x1,y1,z1]: #print '-',x1,y1,z1,array[x1,y1,z1] neighbors+=1 #print 'had this many neighbors',neighbors sa+=(6-neighbors) onethird=float(1)/float(3) twothirds=float(2)/float(3) sph = ((np.pi**onethird)*((6*vol)**twothirds)) / sa #print 'sphericity',sphericity return sph

Sphericity Runtime Comparison:

  Original: 2.123s
 Jaime: 1.819s
 Optimized Cython: 0.136s
 @ moarningsun: 0.090s

Throughout the solution, Cython runs more than 15 times faster, with deployed internal loops (see comment), it runs more than 23 times.

Python / cython / numpy np.nonzero optimization

Convert idioms into things cython understands

Array Type for Quick Indexing

Modify the code to avoid multiple passes through the array

Optimized version

More articles: