Why is converting a list to a set faster than just a list to calculate the difference in lists?

Question

Why is converting a list to a set faster than just a list to calculate the difference in lists?

Let's say I want to calculate the difference of two lists C = A - B :

 A = [1,2,3,4,5,6,7,8,9] B = [1,3,5,8,9] C = [2,4,6,7] #Result

A and B are sorted with unique integers (not sure if there is a way to tell Python about this list property). I need to keep the order of the elements. AFAIK, there are two possible ways to do this.

Method 1 : convert B to set and use list use to create C:

 s = set(B) C = [x for x in A if x not in s]

Method 2 Direct use of the list:

 C = [x for x in A if x not in B]

Why is #1 more effective than #2 ? Is there no overhead to convert to set? What am I missing here?

Some performance tests are provided in this answer.

UPDATE:. I know that the average search time O(1) corresponds to the average of the list O(n) , but if the original list A contains about a million integers, 'Does creating a collection actually take longer?

+6

performance python list set python-2.7

Phd Aug 13 '14 at 19:48

source share

3 answers

The average time complexity of the search (x in S) in the set is O (1), while for the list O (n).

You can check the details at https://wiki.python.org/moin/TimeComplexity

+7

Pankaj sharma Aug 13 '14 at 19:53

source share

According to Python time documentation

Membership in the x in s list is an average of linear time or O(n) .
To establish membership x in s is an average operation with constant time or O(1) .

Building a set is the worst linear time operation, because you would have to scan all the items in the list to build a hash table, therefore O(n) . n is the number of elements in the collection.

The main observation is that in method 1 the construction of the set s = set(B) is just a one-time operation, after which we have only n total number of elements in the set as in x not in B , so in general O(n) + n * O(1) or O(n) time complexity.

While in method 2, for each element in A , a test member of the list x not in B is executed, so the total complexity is n * O(n) = O(n^2) .

+6

user3881791 Aug 13 '14 at 20:07

source share

TheSoundDefense · Accepted Answer · 2014-08-13T19:51:04+0000

There is overhead for converting a list into a set, but the set is much faster than the list for in tags.

You can immediately see if the element x is set in the set y , because a hash table is used below it. No matter how large your set is, the search time is the same (mostly instantaneous) - this is known in Big-O notation as O (1). For a list, you need to individually check each item to see if item x in the list z . As your list grows, checking will take longer - this is O (n), which means that the length of the operation is directly related to how long the list is.

This increased speed can offset the creation overhead, which speeds up your check for installation.

EDIT: To answer this other question, Python has no way to determine if your list is sorted - not if you use a standard list object, anyway. Thus, it cannot achieve O (log n) performance with a list. If you want to write your own binary search method, which involves sorting the list, you can do it, but O (1) will beat O (log n) any day.

Why is converting a list to a set faster than just a list to calculate the difference in lists?

More articles: