PyPy 17x is faster than Python. Can Python accelerate?

Question

PyPy 17x is faster than Python. Can Python accelerate?

Solving a recent problem with Advent of Code , I found that my Python was ~ 40x slower than PyPy by default. I was able to get this code up to about 17 times , limiting calls lenand limiting global search by running it in a function.

Right now it e.pyworks in 5.162 seconds on python 3.6.3 and .297 seconds on PyPy on my machine.

My question is: is this an irreducible JIT acceleration, or is there a way to speed up CPython's response? (not enough extreme funds: could I go to Keaton / Numba or something like that?) How can I convince myself that I can do nothing more?

See gist for a list of input files with numbers.

As described in the description of the problem , they represent bias offsets. position += offsets[current]and increment the current offset by 1. You are finished when the transition goes beyond the list.

Here's an example (full input, which takes 5 seconds, is much longer and has large numbers):

(0) 3  0  1  -3  - before we have taken any steps.
(1) 3  0  1  -3  - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
 2 (3) 0  1  -3  - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
 2  4  0  1 (-3) - jump all the way to the end; leave a 4 behind.
 2 (4) 0  1  -2  - go back to where we just were; increment -3 to -2.
 2  5  0  1  -2  - jump 4 steps forward, escaping the maze.

Code:

def run(cmds):
    location = 0
    counter = 0
    while 1:
        try:
            cmd = cmds[location]
            if cmd >= 3:
                cmds[location] -= 1
            else:
                cmds[location] += 1
            location += cmd
            if location < 0:
                print(counter)
                break
            counter += 1
        except:
            print(counter)
            break

if __name__=="__main__":
    text = open("input.txt").read().strip().split("\n")
    cmds = [int(cmd) for cmd in text]
    run(cmds)

edit: I compiled and ran the code with Cython, which reduced the runtime to 2.53s, but it is still almost an order of magnitude slower than PyPy.

edit: Numba gets me within 2x

edit: The best Cython I could write hit 1.32s, just over 4x pypy

edit: cmd cdef, @viraptor, Cython 0,157 ! , . , PyPy JIT, !

+3

performance python benchmarking pypy

llimllib 06 . '17 4:01

2

Peter Cordes · Answer 1 · 2017-12-06T16:13:01+0000

Python C ( C++ - ). x86-64 clang++. 82 , CPython3.6.2 , , Skylake x86, Python , . (, asm , , ).

JIT - . , Python C, - (, NumPy), C , Cython - , CPython - , - .

: 1,5 ( + add , , 4- L1D). ( ), 6c = 5c + 1c + add ).

, Python , : P ( , 32- 64- , , 4585 18 , 32- L1D-. ABI Linux x32 AArch64 ILP32 ABI.)

, gcc , clang. ( perf stat , , .)

unsigned jumps(int offset[], unsigned size) {
    unsigned location = 0;
    unsigned counter = 0;

    do {
          //location += offset[location]++;            // simple version
          // >=3 conditional version below

        int off = offset[location];

        offset[location] += (off>=3) ? -1 : 1;       // branchy with gcc
        // offset[location] = (off>=3) ? off-1 : off+1;  // branchless with gcc and clang.  

        location += off;

        counter++;
    } while (location < size);

    return counter;
}

#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    std::ios::sync_with_stdio(false);     // makes cin faster
    std::istream_iterator<int> begin(std::cin), dummy;
    std::vector<int> values(begin, dummy);   // construct a dynamic array from reading stdin

    unsigned count = jumps(values.data(), values.size());
    std::cout << count << '\n';
}

clang4.0.1 -O3 -march=skylake ; >=3. ? : ? : , , . =3) %3F+-1 :+1%3B+++++++//"conditional%22+version location++%3D+off; counter++; } while (location+%3C+size); return counter; } %23include+ %23include+ %23include+ int main() { std::ios::sync_with_stdio(false)%3B+++++//makes cin faster std::istream_iterator begin(std::cin),+dummy; std::vector+values(begin,+dummy); unsigned count %3D+jumps(values.data(),+values.size()); std::cout+<%3C+count <%3C+!'\n!'; } '),l:'5',n:'0',o:'C++ source #1',t:'0')),k:35.30937506743596,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang401,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),fontScale:0.8957951999999999,libs:!(),options:'-O3 -fverbose-asm -march=skylake',source:1),l:'5',n:'0',o:'x86-64 clang 4.0.1+(Editor #1, Compiler #1)',t:'0')),k:33.07976551955818,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g72,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),fontScale:0.8957951999999999,libs:!(),options:'-O3 -fverbose-asm -march=skylake',source:1),l:'5',n:'0',o:'x86-64 gcc 7.2+(Editor #1, Compiler #2)',t:'0')),header:(),k:31.610859413005866,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4 rel="nofollow noreferrer">Source + asm Godbolt

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    mov     ebx, edi               ; silly compiler: extra work inside the loop to save code outside
    mov     esi, dword ptr [rax + 4*rbx]  ; off = offset[location]
    cmp     esi, 2
    mov     ecx, 1
    cmovg   ecx, r8d               ; ecx = (off>=3) ? -1 : 1;  // r8d = -1 (set outside the loop)
    add     ecx, esi               ; off += -1 or 1
    mov     dword ptr [rax + 4*rbx], ecx  ; store back the updated off
    add     edi, esi               ; location += off  (original value)
    add     edx, 1                 ; counter++
    cmp     edi, r9d
    jb      .LBB1_4                ; unsigned compare against array size

perf stat./a.out < input.txt ( clang) i7-6700k Skylake:

21841249        # correct total, matches Python

 Performance counter stats for './a.out':

         36.843436      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               119      page-faults               #    0.003 M/sec                  
       143,680,934      cycles                    #    3.900 GHz                    
       245,059,492      instructions              #    1.71  insn per cycle         
        22,654,670      branches                  #  614.890 M/sec                  
            20,171      branch-misses             #    0.09% of all branches        

       0.036953258 seconds time elapsed

4 - . + , . .

int short ( ; movsx , mov Skylake), movsx , L1D, .

( int offsets[] = { file contents with commas added }; . . ~ 36,2 + / - 0,1 , ~ 36,8, , , - , ( Python, C++ Skylake P- Skylake.)

, , [rdi] [rdi + rdx*4] 1 add (index += offset current = target). Intel, IvyBridge mov , . ( ) + asm . ( std::vector): 23.26 +- 0.05 ms, 90,725 (3,900 ), 288.724 M instructions (3,18 ). , , - , .

gcc 2 . (14% perf stat . , , , , , .)

offset[location] = (off>=3)? off-1: off+1; offset[location] = (off>=3)? off-1: off+1; gcc asm, .

gcc7.1.1 -O3 -march = skylake ( , (off <= 3)?: -1: +1).

Performance counter stats for './ec-gcc':

     70.032162      task-clock (msec)         #    0.998 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           118      page-faults               #    0.002 M/sec                  
   273,115,485      cycles                    #    3.900 GHz                    
   255,088,412      instructions              #    0.93  insn per cycle         
    44,382,466      branches                  #  633.744 M/sec                  
     6,230,137      branch-misses             #   14.04% of all branches        

   0.070181924 seconds time elapsed

CPython (Python3.6.2 Arch Linux):

perf stat python ./orig-v2.e.py
21841249

 Performance counter stats for 'python ./orig-v2.e.py':

       3046.703831      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               923      page-faults               #    0.303 K/sec                  
    11,880,130,860      cycles                    #    3.899 GHz                    
    38,731,286,195      instructions              #    3.26  insn per cycle         
     8,489,399,768      branches                  # 2786.421 M/sec                  
        18,666,459      branch-misses             #    0.22% of all branches        

       3.046819579 seconds time elapsed

, PyPy Python.

viraptor · Answer 2 · 2017-12-06T04:16:29+0000

, pypy ( ) .

CPython Cython:

. , . , .
, array.

Cython:

. int int .

PyPy 17x is faster than Python. Can Python accelerate?

More articles: