I have three lists: old, new and ignored. old and new - line lists. ignore is a list of indexes that should be ignored if they do not match. The goal is to create a list of indexes that are different and not ignored.
old and new may contain a different number of elements. If the difference in size is between the old and the new, the difference should be marked as inappropriate (if not ignored).
My current function is as follows:
def CompareFields( old, new, ignore ): if ( old == None ): if ( new == None ): return []; else: return xrange( len(new) ) elif ( new == None ): return xrange( len(old) ) oldPadded = itertools.chain( old, itertools.repeat(None) ) newPadded = itertools.chain( new, itertools.repeat(None) ) comparisonIterator = itertools.izip( xrange( max( len(old ) , len( new ) ) ), oldPadded, newPadded ) changedItems = [ i for i,lhs,rhs in comparisonIterator if lhs != rhs and i not in ignore ] return changedItems
The timings of the various options I tried give the following timings for 100,000 runs:
[4, 9] CompareFields: 6.083546 set([9, 4]) Set based: 12.594869 [4, 9] Function using yield: 13.063725 [4, 9] Use a (precomputed) ignore bitmap: 7.009405 [4, 9] Use a precomputed ignore bitmap and give a limit to itertools.repeat(): 8.297951 [4, 9] Use precomputed ignore bitmap, limit padding and itertools.starmap()/operator.ne(): 11.868687 [4, 9] Naive implementation: 7.438201
My latest version of python is 2.6 (this is RHEL5.5). I am currently compiling Pypy to try.
Anyone have any ideas how to make this feature work faster? Should I use Cython?
If I cannot get it to work faster, I will look at rewriting the entire tool in C ++ or Java.
Edit:
Ok I have timed various answers:
[4, 9] CompareFields: 5.808944 [4, 9] agf itertools answer: 4.550836 set([9, 4]) agf set based answer, but replaced list expression with a set to avoid duplicates: 9.149389 agf set based answer, as described in answer: about 8 seconds lucho set based answer: 10.682579
So now itertools is the way to go. Surprisingly, the dial-based solution performed so poorly. Although I'm not surprised that using lambda was slower.
Edit: Java test A naive implementation with too many if: 128ms commands