Why is dictionary search so much faster than two if tests in Python?

I need to read gigabytes of text, so I'm trying to optimize my code. At the same time, I found that using my dictionary is faster than if-tests for my problem.

check = {'R':'-', 'F':'+'} seqs = ['R', 'F']*100 def check1(): for entry in seqs: if entry == 'R': strand = '-' if entry == 'F': strand = '+' def check2(): for entry in seqs: strand = check[entry] 

Using ipythong% timeit, I see that the dictionary search is a little more than twice as fast as when using two if-tests:

 In [63]: %timeit check1() 10000 loops, best of 3: 38.8 us per loop In [64]: %timeit check2() 100000 loops, best of 3: 16.2 us per loop 

Since if tests are so basic, I did not expect performance differences. Is this known? Can someone explain why this is so?

UPDATE

I checked how the two functions above, as well as check3 () below affect the execution time of my actual code, and there is no effect on the total time. Thus, either the increase obtained with the dictionary is not so large in the real world, where the values ​​of "R" and "F" must be read from the file constantly, or this piece of code is simply not part of my bottleneck.

Anyway, thanks for the answers!

+4
source share
4 answers

As in the case of a large amount of VM code, it basically comes down to the number of virtual machine codes involved.

You can view the collected functions with dis :

 import dis dis.dis(func) 

In 2.6.4, check1 takes about 15-20 operation codes (depending on the code path) for each comparison and branching. check2 takes only 7 (after adding the missing chedict dictionary declared globally).

+4
source

In fact, you have not proven that dictionary searches are faster than two if tests. You have shown that searching in this particular dictionary is faster than these two tests.

Usually, searching a dictionary requires several steps: generate a hash from the key to find a potential match, and then check the potential match by comparing the keys. Sometimes it may take several comparisons if there is a collision with a hash table. If you have user-defined classes for keys, then both of these steps can be slow, they are usually fast for strings, but in one particular case they are really very fast, and you are in this case.

Your dictionary uses keys, which are short strings that correspond to the format of identifiers known at compile time. Python will "put" your strings "R" and "F". Since the strings you use in your tests are also known at compile time, they will be exactly the same instances. What all this means for dictionary searches is that a specialized version of the search is used for a dictionary that has only string keys, the hash has always been pre-computed, and the key comparison is done by comparing the addresses (at least when it succeeds and with your two keys it never fails).

Your real code will be, I assume that I am reading the lines from the input, so it will not have an interned copy of "R". This means that he will need to calculate the hash for each line of input. The addresses do not match, so for each test you need to call the string comparison function. You still have optimizations for having only string keys, and at least it doesn't need to compare common goals for objects that might not be strings.

if do not know anything about object types, so each time they perform a general comparison.

+6
source

Dictionaries are highly optimized in Python; lookup O(1) is just a search in the hash table, and therefore, only one β€œoperation” is half the number of operations you get with your if test sequence (which is O(n) ).

+1
source

Will it open something:

 def check3(): for entry in seqs: if entry == 'R': strand = '-' else: strand = '+' 

This is actually faster than check2() on my computer.

+1
source

Source: https://habr.com/ru/post/1337928/


All Articles