Approach No. 1
Use np.einsum -
np.einsum('ijkl,ilm->ijkm',m0,m1)
Used steps:
Keep the first axes away from aligned inputs.
Lose the last axis from m0 against the second from m1 in the summation.
Let the remaining axes from m0 and m1 expand / expand using elementary multiplications in the external product.
Approach # 2
If you are looking for performance and with the axis of decreasing the sum having a shorter length, you are better off with one circuit and using matrix-multiplication with np.tensordot , and so -
s0,s1,s2,s3 = m0.shape s4 = m1.shape[-1] r = np.empty((s0,s1,s2,s4)) for i in range(s0): r[i] = np.tensordot(m0[i],m1[i],axes=([2],[0]))
Approach No. 3
Now np.dot can be effectively used on 2D inputs to further improve performance. So, with him, a modified version, although a little long, but, I hope, the most productive of them -
s0,s1,s2,s3 = m0.shape s4 = m1.shape[-1] m0.shape = s0,s1*s2,s3 # Get m0 as 3D for temporary usage r = np.empty((s0,s1*s2,s4)) for i in range(s0): r[i] = m0[i].dot(m1[i]) r.shape = s0,s1,s2,s4 m0.shape = s0,s1,s2,s3 # Put m0 back to 4D
Runtime test
Function Definitions -
def original_app(m0, m1): s0,s1,s2,s3 = m0.shape s4 = m1.shape[-1] r = np.empty((s0,s1,s2,s4)) for i in range(s0): for j in range(s1): r[i, j] = np.dot(m0[i, j], m1[i]) return r def einsum_app(m0, m1): return np.einsum('ijkl,ilm->ijkm',m0,m1) def tensordot_app(m0, m1): s0,s1,s2,s3 = m0.shape s4 = m1.shape[-1] r = np.empty((s0,s1,s2,s4)) for i in range(s0): r[i] = np.tensordot(m0[i],m1[i],axes=([2],[0])) return r def dot_app(m0, m1): s0,s1,s2,s3 = m0.shape s4 = m1.shape[-1] m0.shape = s0,s1*s2,s3
Timing and Verification -
In [291]: