Effective use of numpy.random.choice with duplicate numbers and alternatives

I need to create a large array with repeating elements, and my code:

np.repeat(xrange(x,y), data)

However, the data is a numpy array with type float64 (but it is integers, there is no 2.1), and I get an error

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Exemple:

In [35]: x
Out[35]: 26

In [36]: y
Out[36]: 50

In [37]: data
Out[37]: 
array([ 3269.,   106.,  5533.,   317.,  1512.,   208.,   502.,   919.,
     406.,   421.,  1690.,  2236.,   705.,   505.,   230.,   213.,
     307.,  1628.,  4389.,  1491.,   355.,   103.,   854.,   424.])
In [38]: np.repeat(xrange(x,y), data)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call    last)
<ipython-input-38-105860821359> in <module>()
----> 1 np.repeat(xrange(x,y), data)

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy    /core/fromnumeric.pyc in repeat(a, repeats, axis)
394         repeat = a.repeat
395     except AttributeError:
--> 396         return _wrapit(a, 'repeat', repeats, axis)
397     return repeat(repeats, axis)
398 

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy  /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)
 46     except AttributeError:
 47         wrap = None
---> 48     result = getattr(asarray(obj), method)(*args, **kwds)
 49     if wrap:
 50         if not isinstance(result, mu.ndarray):

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I solve it by changing the code to

np.repeat(xrange(x,y), data.astype('int64'))

However, this is now one of the most expensive lines in my code! Is there any other alternative?

By the way, I use this inside

np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)

to get a sample without replacement with a size z of integers between x and y, with the number of each of the data in the data. Think this is the best approach for this too?

+4
source share
3 answers

. Numpy urn , . , @DiogoSantos, . , , , Divakar optmized_v1.

, sample(n, colors) , Divakar.

def hypergeom_version(x, y, z, data):
    s = sample(z, data)
    result = np.repeat(np.arange(x, y), s)
    return result

( . , , np.random.shuffle(result) return. .)

:

In [153]: x = 100

In [154]: y = 100100

In [155]: z = 10000

In [156]: data = np.random.randint(1, 125, (y-x)).astype(float)

Divakar optimized_v1:

In [157]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 520 ms per loop

hypergeom_version:

In [158]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 244 ms per loop

data , :

In [164]: data = np.random.randint(100, 500, (y-x)).astype(float)

In [165]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 2.91 s per loop

In [166]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 246 ms per loop
+4

! , , , a = np.arange(5), ..

a = np.array([0,1,2,3,4])

, , 5 a. , :

reps = np.array([2,4,6,2,2])

:

In [32]: rep_nums = np.repeat(a,reps)

In [33]: rep_nums
Out[33]: array([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4])

, z np.random.choice() .

z = 7, 7, np.random.choice() :

In [34]: np.random.choice(rep_nums,7,replace=False)
Out[34]: array([2, 4, 0, 2, 4, 1, 2])

without replacement , rep_nums. , , , np.random.choice() , . 4's, rep_nums 4's.

, , np.repeat, .

rep_nums, z = 7 , rep_nums:

In [44]: np.random.choice(rep_nums.size,7,replace=False)
Out[44]: array([ 7,  2,  4, 10, 13,  8,  3])

. , bin ( 5 ) rep_nums, 7. np.searchsorted. , x, y, :

# Get the intervals of those bins
intervals = data.astype(int).cumsum()

# Decide length of array if we had repeated with `np.repeat`
max_num = intervals[-1]

# Get unique numbers (indices in this case)
ids = np.random.choice(max_num,z,replace=False)

# Use searchsorted to get bin IDs and add in `x` offset
out = x+np.searchsorted(intervals,ids,'right')

:

def org_app(x,y,z,data):
    rep_nums = np.repeat(range(x,y), data.astype('int64'))
    out = np.random.choice(rep_nums, z,replace=False)
    return out

def optimized_v1(x,y,z,data):     
    intervals = data.astype(int).cumsum()
    max_num = intervals[-1]
    ids = np.random.choice(max_num,z,replace=False)
    out = x+np.searchsorted(intervals,ids,'right')
    return out

-

In [79]: # Setup inputs
    ...: x = 100
    ...: y = 10010
    ...: z = 1000
    ...: data = np.random.randint(100,5000,(y-x)).astype(float)
    ...: 

In [80]: %timeit org_app(x,y,z,data)
1 loop, best of 3: 7.17 s per loop

In [81]: %timeit optimized_v1(x,y,z,data)
1 loop, best of 3: 6.92 s per loop

, . , np.repeat!

-

In [82]: %timeit np.repeat(range(x,y), data.astype('int64'))
1 loop, best of 3: 227 ms per loop

, . , , np.random.choice(), -

In [83]: intervals = data.astype(int).cumsum()
    ...: max_num = intervals[-1]
    ...: ids = np.random.choice(max_num,z,replace=False)
    ...: out = x+np.searchsorted(intervals,ids,'right')
    ...: 

In [84]: %timeit data.astype(int).cumsum()
10000 loops, best of 3: 36.6 ยตs per loop

In [85]: %timeit intervals[-1]
10000000 loops, best of 3: 142 ns per loop

In [86]: %timeit x+np.searchsorted(intervals,ids,'right')
10000 loops, best of 3: 127 ยตs per loop

, 227ms np.repeat!!

, , np.repeat , np.random.choice() .

+4

To complete, I also have an alternative implementation. Given what we have data, we can use hypergeometric sampling for each class:

  • calculate the inverse data.cumsum()
  • for each draw class np.hypergeometric(data[pos], cumsum[pos]-data[pos], remain)

However, when we have many classes with several units in each, it takes a lot of time.

+1
source

Source: https://habr.com/ru/post/1653287/


All Articles