Why is Python3 so much slower than Python2 in my task?

I was surprised to learn that Python 3.5.2 much slower than Python 2.7.12 . I wrote a simple command line command that calculates the number of lines in a huge CSV file.

 $ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))" 101253515 # it took 15 seconds $ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))" 101253515 # it took 66 seconds 

Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. I expected that the difference could happen, but why is it so huge? What's new in Python 3 makes it much slower with respect to such tasks? Is there a faster way to calculate the number of rows in Python 3?

My processor is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz .

huge.csv is 18.1 GB in size and contains 101253515 lines.

When asking this question, I do not need to find the exact number of lines of a large file at any cost. I just wrote a specific case where Python 3 is much slower. In fact, I am developing a script in Python 3 that deals with large CSV files, some operations do not involve using the csv library. I know I could write a script in Python 2, and that would be acceptable for speed. But I would like to know a way to write a similar script in Python 3. That's why I am interested in what makes Python 3 slower in my example and how it can be improved with the help of "honest" approaches to python.

+5
source share
1 answer

sys.stdin object is a bit more complicated in Python3 than in Python2. For example, reading from sys.stdin in Python3 by default converts the input to unicode, so it fails for bytes without Unicode:

 $ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))" Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 1, in <genexpr> File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte 

Note that Python2 has no problem with this input. Since you can see that Python3 sys.stdin does more things under the hood. I'm not sure if this is exactly responsible for the performance loss, but you can explore it further by trying sys.stdin.buffer under Python3:

 import sys print(sum(1 for _ in sys.stdin.buffer)) 

Note that .buffer does not exist in Python2. I did some tests and I see no real performance difference between Python2 sys.stdin and Python3 sys.stdin.buffer , but YMMV.

EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. First, some C code (as a base for comparison):

 #include <unistd.h> int main() { char buffer[4096]; size_t total = 0; while (true) { int result = ::read(STDIN_FILENO, buffer, sizeof(buffer)); total += result; if (result == 0) { break; } } return 0; }; 

now file size:

 $ ls -s --block-size=M | grep huge2.txt 10898M huge2.txt 

and tests:

 // a.out is a simple C equivalent code (except for the final print) $ time cat huge2.txt | ./a.out real 0m20.607s user 0m0.236s sys 0m10.600s $ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))" 898773889 real 1m24.268s user 1m20.216s sys 0m8.724s $ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))" 898773889 real 1m19.734s user 1m14.432s sys 0m11.940s $ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))" 898773889 real 2m0.326s user 1m56.148s sys 0m9.876s 

So, the file I used was a little smaller, and the times were longer (it seems you have a better machine, and I did not have the patience for large files: D). In any case, Python2 and Python3 sys.stdin.buffer are very similar in my tests. Python3 sys.stdin is slower. And all of them are waaaay behind the C code (which has almost 0 user time).

+5
source

Source: https://habr.com/ru/post/1273166/


All Articles