Approach No. 1: C np.einsum-
np.einsum('ij,ik,i->jk',p,p,w)
Approach # 2: C broadcasting+ np.tensordot-
np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
Approach No. 3: C np.einsum+ np.dot-
np.einsum('ij,i->ji',p,w).dot(p)
Runtime test
Install # 1:
In [653]: p = np.random.rand(50,30)
In [654]: w = np.random.rand(50)
In [655]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10000 loops, best of 3: 101 µs per loop
In [656]: %timeit np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
10000 loops, best of 3: 124 µs per loop
In [657]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
100000 loops, best of 3: 9.07 µs per loop
Install # 2:
In [658]: p = np.random.rand(500,300)
In [659]: w = np.random.rand(500)
In [660]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10 loops, best of 3: 139 ms per loop
In [661]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
1000 loops, best of 3: 1.01 ms per loop
The third approach just blew everything else!
Why Approach #3is 10x-130x faster than Approach #1?
np.einsum C. i, j, k ( C, ), .
i, j, , ( C), BLAS matrix-multiplication np.dot. .