Understanding the difference between vectorization in Numpy and multithreading of a vectorized expression through Numexpr

Question

Understanding the difference between vectorization in Numpy and multithreading of a vectorized expression through Numexpr

I'm struggling a bit with the concept that NumPy is called vectorizing its operations with arithmetic arrays: does Python overcome the GIL because part of NumPy is implemented in C? Also, how does Numexpr work? If I understand correctly, it runs the code through an optimizing JIT and enables multithreading and thus overcomes the Python GIL.

And is the "true" vectorization more similar to multiprocessing, and not to multithreading?

+6

python multithreading numpy vector

Sebastian Jun 30 '14 at 20:09

source share

2 answers

Numexpr is good with calculations such as multiplying arrays and shrinking at a time - also when using numpy memmap as input. Therefore, in operations like (ij, jk-> i), numexpr is awesome oneliner when in numpy it becomes (ij, jk → ik → i). Like here numpy - python - a way to quickly multiply and reduce the matrix when working in memmaps and CPU - stack overflow .

0

Robert Grzelka Sep 14 '15 at 20:00

source share

Drv · Accepted Answer · 2014-06-30T20:14:37+0000

NumPy may in some cases use a library that uses several processes for processing and thereby distributes the load across multiple cores. This, however, is library dependent and has little to do with python code in NumPy. So yes, NumPy and any other library can overcome these limitations if they are not written in python. There are even some libraries that offer accelerated GPU features.

NumExpr uses the same method to bypass GIL. On your home page:

In addition, numexpr implements support for multithreaded computing directly on its internal virtual machine, written in C. This allows you to bypass the GIL in Python

However, there are some fundamental differences between NumPy and NumExpr. Numpy is focused on creating a good Pythonic interface for operations with arrays, NumExpr has a much narrower scope and its own language. When NumPy performs the calculation c = 3*a + 4*b , where the operands are arrays, two temporary arrays ( 3*a and 4*b ) are created in the process. In this case, NumExpr can optimize the calculation so that multiplications and additions are performed in stages without the use of any intermediate results.

This leads to some interesting things with NumPy. The following tests were conducted with a 4-core 8-thread i7 processor, and the time was reviewed with iPython %timeit :

 import numpy as np import numexpr as ne def addtest_np(a, b): a + b def addtest_ne(a, b): ne.evaluate("a+b") def addtest_np_inplace(a, b): a += b def addtest_ne_inplace(a, b): ne.evaluate("a+b", out=a) def addtest_np_constant(a): a + 3 def addtest_ne_constant(a): ne.evaluate("a+3") def addtest_np_constant_inplace(a): a += 3 def addtest_ne_constant_inplace(a): ne.evaluate("a+3", out=a) a_small = np.random.random((100,10)) b_small = np.random.random((100,10)) a_large = np.random.random((100000, 1000)) b_large = np.random.random((100000, 1000)) # results: (time given is in nanoseconds per element with small/large array) # np: NumPy # ne8: NumExpr with 8 threads # ne1: NumExpr with 1 thread # # a+b: # np: 2.25 / 4.01 # ne8: 22.6 / 3.22 # ne1: 22.6 / 4.21 # a += b: # np: 1.5 / 1.26 # ne8: 36.8 / 1.18 # ne1: 36.8 / 1.48 # a+3: # np: 4.8 / 3.62 # ne8: 10.9 / 3.09 # ne1: 20.2 / 4.04 # a += 3: # np: 3.6 / 0.79 # ne8: 34.9 / 0.81 # ne1: 34.4 / 1.06

Of course, using synchronization methods is not very accurate, but there are certain general trends:

NumPy uses fewer cloc loops (np <ne1)
parallelism helps a bit with very large arrays (10-20%)
NumExpr is much slower with small arrays
NumPy is very strong with on-site operations

NumPy does not make simple arithmetic operations parallel, but, as can be seen from the above, it does not really matter. The speed is mainly limited by the memory bandwidth, not the processing power.

If we do something more complex, everything changes.

 np.sin(a_large) # 19.4 ns/element ne.evaluate("sin(a_large)") # 5.5 ns/element

Speed is no longer limited by memory bandwidth. To make sure that this is really thread-related (and not because NumExpr sometimes uses some fast libraries):

 ne.set_num_threads(1) ne.evaluate("sin(a_large)") # 34.3 ns/element

Here parallelism helps a lot.

NumPy can use parallel processing with more complex linear operations, such as matrix inversions. These operations are not supported by NumExpr, so there is no meaningful comparison. Actual speed depends on the library used (BLAS / Atlas / LAPACK). In addition, when performing complex operations such as FFT, performance is library dependent. (AFAIK, NumPy / SciPy does not yet support fftw .)

As a result, it seems that NumExpr cases are very fast and useful. Then there are times when NumPy is the fastest. If you have arrays of rage and item-wise operations, NumExpr is very powerful. However, it should be noted that some parallelism (or even the spread of computing on computers) is often quite easy to incorporate into the code using multiprocessing or something equivalent.

The issue of "multiprocessing" and "multithreading" is somewhat more complicated, as the terminology is a little shaky. In python, "thread" is something that works under the same GIL, but if we talk about threads and processes of the operating system, then there is no difference between them. For example, in Linux there is no difference between the two.

Understanding the difference between vectorization in Numpy and multithreading of a vectorized expression through Numexpr

More articles: