Joining strings. Understanding a generator or list?

Consider the problem of extracting alphabets from a huge string.

One way to do it

''.join([c for c in hugestring if c.isalpha()]) 

The mechanism is clear: list comprehension generates a list of characters. The join method knows how many characters it needs to combine by referring to the length of the list.

Another way to do it

 ''.join(c for c in hugestring if c.isalpha()) 

Here, understanding the generator leads to the generator. The join method does not know how many characters it is about to join, because the generator does not have the len attribute. Thus, this connection method should be slower than the list comprehension method.

But python testing shows that it is not slower. Why is this so? Can anyone explain how the connection works on the generator.

To be clear:

 sum(j for j in range(100)) 

no need to know 100 because it can track the total amount. He can access the next item using the following method for the generator, and then add to the total. However, since the lines are immutable, concatenating the lines would cumulatively create a new line at each iteration. So it will take a long time.

+5
source share
3 answers

When you call str.join(gen) , where gen is the generator, Python executes the equivalent of list(gen) before proceeding with the length of the resulting sequence.

In particular, if you look at the code that implements str.join in CPython , you will see this call:

  fseq = PySequence_Fast(seq, "can only join an iterable"); 

Calling PySequence_Fast converts the seq argument to a list if it was not already a list or tuple.

So, the two versions of your call are handled almost the same. In understanding the list, you create the list yourself and pass it to join . In the version of the generator expression, the passed generator object turns into a list at the beginning of join , and the rest of the code works the same for both versions.

+10
source

join() does not need to be implemented as sequentially adding elements of a sequence to a longer and longer accumulated string (which will indeed be very slow for long sequences); he just needs to produce the same result. Thus, join() probably just adds characters to some internal memory buffer and creates a line from it at the end. On the other hand, the list comprehension construct needs to first build the list (by moving the hugestring generator), and only then let join() begin its work.

In addition, I doubt that join() looks at the length of the list, since it cannot know that each element is a single character (in most cases it will not) - it probably just gets the generator from the list.

+1
source

At least on my machine, list comprehension is faster for the case I tested, probably because ''.join can optimize memory allocation. This probably depends only on the specific example that you are testing (for example, if the condition you are testing is less frequent, the price of CPython pays for not knowing the length ahead of time may be less):

 In [18]: s = ''.join(np.random.choice(list(string.printable), 1000000)) In [19]: %timeit ''.join(c for c in s if c.isalpha()) 10 loops, best of 3: 69.1 ms per loop In [20]: %timeit ''.join([c for c in s if c.isalpha()]) 10 loops, best of 3: 61.8 ms per loop 
+1
source

Source: https://habr.com/ru/post/1240740/


All Articles