Optimize the comparison between the two lists by specifying indexes that differ

I have three lists: old, new and ignored. old and new - line lists. ignore is a list of indexes that should be ignored if they do not match. The goal is to create a list of indexes that are different and not ignored.

old and new may contain a different number of elements. If the difference in size is between the old and the new, the difference should be marked as inappropriate (if not ignored).

My current function is as follows:

def CompareFields( old, new, ignore ): if ( old == None ): if ( new == None ): return []; else: return xrange( len(new) ) elif ( new == None ): return xrange( len(old) ) oldPadded = itertools.chain( old, itertools.repeat(None) ) newPadded = itertools.chain( new, itertools.repeat(None) ) comparisonIterator = itertools.izip( xrange( max( len(old ) , len( new ) ) ), oldPadded, newPadded ) changedItems = [ i for i,lhs,rhs in comparisonIterator if lhs != rhs and i not in ignore ] return changedItems 

The timings of the various options I tried give the following timings for 100,000 runs:

 [4, 9] CompareFields: 6.083546 set([9, 4]) Set based: 12.594869 [4, 9] Function using yield: 13.063725 [4, 9] Use a (precomputed) ignore bitmap: 7.009405 [4, 9] Use a precomputed ignore bitmap and give a limit to itertools.repeat(): 8.297951 [4, 9] Use precomputed ignore bitmap, limit padding and itertools.starmap()/operator.ne(): 11.868687 [4, 9] Naive implementation: 7.438201 

My latest version of python is 2.6 (this is RHEL5.5). I am currently compiling Pypy to try.

Anyone have any ideas how to make this feature work faster? Should I use Cython?

If I cannot get it to work faster, I will look at rewriting the entire tool in C ++ or Java.

Edit:

Ok I have timed various answers:

 [4, 9] CompareFields: 5.808944 [4, 9] agf itertools answer: 4.550836 set([9, 4]) agf set based answer, but replaced list expression with a set to avoid duplicates: 9.149389 agf set based answer, as described in answer: about 8 seconds lucho set based answer: 10.682579 

So now itertools is the way to go. Surprisingly, the dial-based solution performed so poorly. Although I'm not surprised that using lambda was slower.

Edit: Java test A naive implementation with too many if: 128ms commands

+4
source share
3 answers

For both of these solutions you should:

 ignore = set(ignore) 

which will give you a constant (average) time in .

I think this is the itertools / zip based method you were looking for:

 [i for i, (o, n) in enumerate(izip_longest(old, new)) if o != n and i not in ignore] 

There is no need for chain / repeat for laying - what is izip_longest ? enumerate also more suitable than xrange .

And the more Pythonic (and possibly faster) version of the filter / set difference method in Lucho responds:

 [i for i, v in set(enumerate(new)).symmetric_difference(enumerate(old)) if i not in ignore] 

List enumerations are preferable to filter or map on lambda , and there is no need to convert both lists to set if you use the symmetric_difference method instead of the ^ / xor operator.

+4
source

make ignore as installed.

filter(lambda x: x[0] not in ignore, set(enumerate(new)) ^ set(enumerate(old)))

I bet that it will be faster than your excessive non-pyphonic outs attempts (it would be great if you could measure it - I'm curious).

+1
source
List constructors

- this is certainly a pythonic thing, I would do something similar to this:

 def findDiff(old, new, ignore): ignore = set(ignore) diff = [] (small, big) = (old, new) if len(old) < len(new) else (new, old) diff.extend([i for i in xrange(0,len(small)) if i not in ignore and old[i] != new[i]]) diff.extend([i for i in xrange(len(small), len(big)) if i not in ignore]) return diff 

for a quick function, this assumes that all indexes over the length of the smallest list will be considered different and still be ignored.

+1
source

Source: https://habr.com/ru/post/1379240/


All Articles