Why is numpy.random.choice so slow?

While writing the script, I found the numpy.random.choice function. I implemented it because it was much cleaner than the equivalent if statement. However, after running the script, I realized that it is much slower than the if statement.

The following is the MWE. The first method takes 0.0 s and the second takes 7.2 s. If you increase the ith cycle, you will see how random.choice decreases rapidly.

Can anyone comment on why random.choice is so much slower?

import numpy as np import numpy.random as rand import time as tm #------------------------------------------------------------------------------- tStart = tm.time() for i in xrange(100): for j in xrange(1000): tmp = rand.rand() if tmp < 0.25: var = 1 elif tmp < 0.5: var = -1 print('Time: %.1f s' %(tm.time() - tStart)) #------------------------------------------------------------------------------- tStart = tm.time() for i in xrange(100): for j in xrange(1000): var = rand.choice([-1, 0, 1], p = [0.25, 0.5, 0.25]) print('Time: %.1f s' %(tm.time() - tStart)) 
+8
source share
4 answers

You use it wrong. Vectorizing an operation, or numpy will not do any good:

 var = numpy.random.choice([-1, 0, 1], size=1000, p=[0.25, 0.5, 0.25]) 

Sync data:

 >>> timeit.timeit('''numpy.random.choice([-1, 0, 1], ... size=1000, ... p=[0.25, 0.5, 0.25])''', ... 'import numpy', number=10000) 2.380380242513752 >>> timeit.timeit(''' ... var = [] ... for i in xrange(1000): ... tmp = rand.rand() ... if tmp < 0.25: ... var.append(1) ... elif tmp < 0.5: ... var.append(-1) ... else: ... var.append(0)''', ... setup='import numpy.random as rand', number=10000) 5.673041396894519 
+17
source

It took me a lot of time to understand that my data generator is very slow due to random selection of keys through np.random.choice .

In case an uneven distribution is NOT necessary, here is a suitable solution that I found:

replace

 def get_random_key( a_huge_key_list ) : return np.random.choice( a_huge_key_list ) 

with

 def get_random_key( a_huge_key_list ) : L = len(a_huge_key_list) i = np.random.randint(0, L) return a_huge_key_list[i] 

which greatly improves my speed by 60 times.

+2
source

I suspect that the np.random.choice community slows it down, especially for small samples than large ones.

Rough vectorization of the if version:

 def foo(n): x = np.random.rand(n) var = np.zeros(n) var[x<.25] = -1 var[x>.75] = 1 return var 

Running in ipython I get:

 timeit np.random.choice([-1,0,1],size=1000,p=[.25,.5,.25]) 1000 loops, best of 3: 293 us per loop timeit foo(1000) 10000 loops, best of 3: 83.4 us per loop timeit np.random.choice([-1,0,1],size=100000,p=[.25,.5,.25]) 100 loops, best of 3: 11 ms per loop timeit foo(100000) 100 loops, best of 3: 8.12 ms per loop 

So, for a size of 1000 choice ’s 3-4x slower, but with large vectors the difference begins to disappear.

+1
source

This solution with an aggregate score is about 25 times faster:

 def choice(options,probs): x = np.random.rand() cum = 0 for i,p in enumerate(probs): cum += p if x < cum: break return options[i] options = ['a','b','c','d'] probs = [0.2,0.6,0.15,0.05] runs = 100000 now = time.time() temp = [] for i in range(runs): op = choice(options,probs) temp.append(op) temp = Counter(temp) for op,x in temp.items(): print(op,x/runs) print(time.time()-now) print("") now = time.time() temp = [] for i in range(runs): op = np.random.choice(options,p = probs) temp.append(op) temp = Counter(temp) for op,x in temp.items(): print(op,x/runs) print(time.time()-now) 

Running it, I get:

 b 0.59891 a 0.20121 c 0.15007 d 0.04981 0.16232800483703613 b 0.5996 a 0.20138 c 0.14856 d 0.05046 3.8451428413391113 
0
source

Source: https://habr.com/ru/post/953190/


All Articles