The smart way to remove tuples

I have a list of tuples, as described below (this set is sorted in descending order of the second value):

from string import ascii_letters myTup = zip (ascii_letters, range(10)[::-1]) threshold = 5.5 >>> myTup [('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), \ ('i', 1), ('j', 0)] 

Given the threshold, what is the best way to discard all tuples having a second value less than that threshold.

I have more than 5 million tuples and therefore do not want to map the tuple to the tuple and therefore delete or add to another list of tuples.

+4
source share
5 answers

Since the tuples are sorted, you can simply find the first tuple with a value less than the threshold, and then delete the remaining values ​​using the slice notation:

 index = next(i for i, (t1, t2) in enumerate(myTup) if t2 < threshold) del myTup[index:] 

According to Vaughn Cato, binary search will accelerate even more. bisect.bisect would be useful, except that it would not work with your current data structure unless you create a separate key sequence as described here . But this violates your ban on creating new lists.

However, you can use the source code as the basis for your own binary search. Or you can change your data structure:

 >>> myTup [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')] >>> index = bisect.bisect(myTup, (threshold, None)) >>> del myTup[:index] >>> myTup [(6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')] 

The disadvantage is that deletion can occur over linear time, since Python will have to move the entire memory block back ... unless Python is smart about deleting fragments starting at 0 . (Somebody knows?)

Finally, if you really want to change the data structure, you can do this:

 [(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd'), (-5, 'e'), (-4, 'f'), (-3, 'g'), (-2, 'h'), (-1, 'i'), (0, 'j')] >>> index = bisect.bisect(myTup, (-threshold, None)) >>> del myTup[index:] >>> myTup [(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd')] 

(Note that Python 3 will complain about comparing None , so instead you can use something like (-threshold, chr(0)) .)

My suspicion is that the linear time search, which I suggested at the beginning, is acceptable in most cases.

+6
source

Here's an exotic approach that wraps a list in an object similar to a list before executing bisect.

 import bisect def revkey(items): class Items: def __getitem__(self, index): assert 0 <= index < _len return items[_max-index][1] def __len__(self): return _len def bisect(self, value): return _len - bisect.bisect_left(self, value) _len = len(items) _max = _len-1 return Items() tuples = [('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), ('i', 1), ('j', 0)] for x in range(-2, 12): assert len(tuples) == 10 t = tuples[:] stop = revkey(t).bisect(x) del t[stop:] assert t == [item for item in tuples if item[1] >= x] 
+2
source

Perhaps a bit faster code than @Curious:

 newTup=[] for tup in myTup: if tup[1]>threshold: newTup.append(tup) else: break 

Since the tuples are ordered, you do not need to go through all of them.

Another possibility is to use the bisector and find the index i last element, which is above the threshold value. Then you will do:

 newTup=myTup[:i] 

I think the last method will be the fastest.

+1
source

Given the number of tuples you are dealing with, you might consider using NumPy .

Define a structured array like

 my_array= np.array(myTup, dtype=[('f0',"|S10"), ('f1',float)]) 

You can access the second elements of your tuples with myarray['f1'] , which gives you a floating-point array. You can use fancy indexing to filter out the desired elements, for example

 my_array[myarray['f1'] < threshold] 

saving only records in which your f1 less than your threshold ..

0
source

You can also use itertools , for example.

 from itertools import ifilter iterable_filtered = ifilter(lambda x : x[1] > threshold, myTup) 

If you need an iterable filtered list or just:

 filtered = filter(lambda x: x[1] > threshold, myTup) 

to go straight to the list.

I am not very good at the relative performance of these methods and should test them (for example, in IPython using% timeit ).

0
source

Source: https://habr.com/ru/post/1433836/


All Articles