The complexity of len () with respect to sets and lists

The complexity of len() with respect to sets and lists is equally O (1). Why does it take longer to process sets?

 ~$ python -m timeit "a=[1,2,3,4,5,6,7,8,9,10];len(a)" 10000000 loops, best of 3: 0.168 usec per loop ~$ python -m timeit "a={1,2,3,4,5,6,7,8,9,10};len(a)" 1000000 loops, best of 3: 0.375 usec per loop 

Does this relate to a specific benchmark, since it has more time to create sets than lists, and this indicator also takes this into account?

If creating a given object takes longer than creating a list, what will be the main reason?

+44
python time-complexity python-internals
Aug 27 '15 at 12:03
source share
7 answers

Firstly, you did not measure the speed of len() , you measured the speed of creating a list / set along with the speed of len() .

Use the --setup timeit argument:

 $ python -m timeit --setup "a=[1,2,3,4,5,6,7,8,9,10]" "len(a)" 10000000 loops, best of 3: 0.0369 usec per loop $ python -m timeit --setup "a={1,2,3,4,5,6,7,8,9,10}" "len(a)" 10000000 loops, best of 3: 0.0372 usec per loop 

Statements that you pass --setup are executed before the speed measurement len() .

Secondly, it should be noted that len(a) is a fairly fast operator. The process of measuring its speed may be “noise”. Note that code executed (and measured) in time is equivalent to the following:

 for i in itertools.repeat(None, number): len(a) 

Since both len(a) and itertools.repeat(...).__next__() are fast operations, and their speeds may be similar, the speed of itertools.repeat(...).__next__() can affect the timings.

For this reason, you better measure len(a); len(a); ...; len(a) len(a); len(a); ...; len(a) len(a); len(a); ...; len(a) (repeated 100 times or so) so that the body of the for loop takes a significantly longer amount of time than the iterator:

 $ python -m timeit --setup "a=[1,2,3,4,5,6,7,8,9,10]" "$(for i in {0..1000}; do echo "len(a)"; done)" 10000 loops, best of 3: 29.2 usec per loop $ python -m timeit --setup "a={1,2,3,4,5,6,7,8,9,10}" "$(for i in {0..1000}; do echo "len(a)"; done)" 10000 loops, best of 3: 29.3 usec per loop 

(The results still say that len() has the same characteristics in lists and sets, but now you are sure that the result is correct.)

Thirdly, it is true that “complexity” and “speed” are related, but I think you are confused. The fact that len() has O (1) complexity for lists and sets does not mean that it should work at the same speed in lists and sets.

This means that on average, no matter how long the list a , len(a) performs the same asymptotic number of steps. And no matter how long b set, len(b) performs the same asymptotic number of steps. But the algorithm for calculating the size of lists and sets may be different, which leads to different characteristics (timeit shows that this is not so, however this may be an opportunity).

Finally,

If creating a given object takes longer than creating a list, what will be the main reason?

A set, as you know, does not allow repeating elements. Sets in CPython are implemented as hash tables (to provide averaging and O (1) lookups): building and maintaining a hash table is much more complicated than adding items to the list.

In particular, when building a set, you need to calculate hashes, build a hash table, look at it to avoid inserting recurring events, etc. In contrast, lists in CPython are implemented as a simple array of pointers, which are malloc() ed and realloc() ed if necessary.

+105
Aug 27 '15 at 12:12
source share

Corresponding lines of http://svn.python.org/view/python/trunk/Objects/setobject.c?view=markup#l640

 640 static Py_ssize_t 641 set_len(PyObject *so) 642 { 643 return ((PySetObject *)so)->used; 644 } 

and http://svn.python.org/view/python/trunk/Objects/listobject.c?view=markup#l431

 431 static Py_ssize_t 432 list_length(PyListObject *a) 433 { 434 return Py_SIZE(a); 435 } 

Both are just a static search.

So what is the difference you can ask. You also measure the creation of objects. And it's a little longer to create a set than a list.

+18
Aug 27 '15 at 12:11
source share

Use this with the -s flag for timeit without considering the first line:

 ~$ python -mtimeit -s "a=range(1000);" "len(a)" 10000000 loops, best of 3: 0.0424 usec per loop 



 ~$ python -mtimeit -s "a={i for i in range(1000)};" "len(a)" 10000000 loops, best of 3: 0.0423 usec per loop 

Now it only takes into account the len function, and the results are almost the same, since we did not take into account the time it took to create the / list set.

+6
Aug 27 '15 at 12:13
source share

Yes, you are right, this is more due to the different time required to create set and list objects using python. As a fairer reference, you can use the timeit module and pass objects using the setup argument:

 from timeit import timeit print '1st: ' ,timeit(stmt="len(a)", number=1000000,setup="a=set([1,2,3]*1000)") print '2nd : ',timeit(stmt="len(a)", number=1000000,setup="a=[1,2,3]*1000") 

result:

 1st: 0.04927110672 2nd : 0.0530669689178 

And if you want to know why this is so, let's take a look at the world of python. In fact, the specified object uses the hash table, and the hash table uses the hash function to create the hash values ​​of the elements and display them in the values ​​and in this transaction, calling the function and calculating the hash values, and some other additional tasks will take a lot of time. Although to create a python list just create a sequence of objects that you can access them with indexing.

You can learn more about the set_lookkey function from the set_lookkey source code .

Also note that if two algorithms have the same complexity, this does not mean that both algorithms have exactly the same runtime or speed. one




<sub> because the designation big O describes the restriction of the behavior of the function and does not show the exact equation of complexity. For example, the complexity of the following equations f(x)=100000x+1 and f(x)=4x+20 is O (1) and this means that both are linear bur equations, as you can see that the first function has a significantly larger slope , and for one input they will give different results.

sub>

+5
Aug 27 '15 at 12:09 on
source share

Let me give you some excellent answers here: O(1) tells you about the growth order in terms of input size.

O(1) , in particular, means only constant time relative to the size of the input . The method can always take 0.1 s for any input, and the other can take 1000 years for any input, and both of them will be O(1)

In this case, although the documentation has some degree of ambiguity, this means that the method takes about the same time to process a list of size 1 , as required to process a list of sizes 1000 ; similarly, processing a dictionary of size 1 requires the same time as processing a dictionary of size 1000 .

No warranty is provided with respect to different types of data .

This is not surprising, since the implementation of len() at some point down the call stack may vary depending on the data type.

By the way, this ambiguity is eliminated in statically typed languages , where ClassA.size() and ClassB.size() use two different methods for all purposes and other methods.

+3
Aug 27 '15 at 12:35
source share

Remove the len(a) operator. The result is almost the same. The set must be hashed to preserve only individual elements, so it is slower.

+1
Aug 27 '15 at 12:09 on
source share

Many noted that O (1) does not apply to performance on different types of data, but about performance as a function of different input sizes.

If you try to test O (1) -ness, you will look for something more like

 ~$python -m timeit --setup "a=list(range(1000000))" "len(a)" 10000000 loops, best of 3: 0.198 usec per loop ~$python -m timeit --setup "a=list(range(1))" "len(a)" 10000000 loops, best of 3: 0.156 usec per loop 

Big data or small data, the time spent is very similar. In other posts, this separates the setup time from the test time, but does not reduce len-time vs loop-time noise.

+1
Aug 27 '15 at 21:14
source share



All Articles