Pandas randomly replace k percent

has a simple pandas data frame with 2 columns, for example. id and value , where value either 0 or 1 I would like to accidentally replace 10% all value==1 with 0 .

How can I achieve this behavior with pandas?

+6
source share
3 answers

pandas answer

  • use query to filter df only with value == 1
  • use sample(frac=.1) to take 10% of these
  • use result index to assign zero

 df.loc[ df.query('value == 1').sample(frac=.1).index, 'value' ] = 0 

alternative numpy answer

  • get a boolean array, where df['value'] - 1
  • assign a random array of 10% zeros and 90%

 v = df.value.values == 1 df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9)) 
+8
source

you can use numpy.random.choice :

 >>> idx = df.index[df.value==1] >>> df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0 
+2
source

Here's the NumPy approach with np.random.choice -

 a = df.value.values # get a view into value col idx = np.flatnonzero(a) # get the nonzero indices # Finally select unique 10% from those indices and set 0s there a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0 

Run Example -

 In [237]: df = pd.DataFrame(np.random.randint(0,2,(100,2)),columns=['id','value']) In [238]: (df.value==1).sum() # Original Count of 1s in df.value column Out[238]: 53 In [239]: a = df.value.values In [240]: idx = np.flatnonzero(a) In [241]: a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0 In [242]: (df.value==1).sum() # New count of 1s in df.value column Out[242]: 48 

Alternatively, a more pandas approach is

 idx = np.flatnonzero(df['value']) df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0 

Runtime test

All approaches published so far -

 def f1(df): #@piRSquared soln1 df.loc[df.query('value == 1').sample(frac=.1).index,'value'] = 0 def f2(df): #@piRSquared soln2 v = df.value.values == 1 df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9)) def f3(df): #@Roman Pekar soln idx = df.index[df.value==1] df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0 def f4(df): #@Mine soln1 a = df.value.values idx = np.flatnonzero(a) a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0 def f5(df): #@Mine soln2 idx = np.flatnonzero(df['value']) df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0 

Dates -

 In [2]: # Setup inputs ...: df = pd.DataFrame(np.random.randint(0,2,(10000,2)),columns=['id','value']) ...: df1 = df.copy() ...: df2 = df.copy() ...: df3 = df.copy() ...: df4 = df.copy() ...: df5 = df.copy() ...: In [3]: # Timings ...: %timeit f1(df1) ...: %timeit f2(df2) ...: %timeit f3(df3) ...: %timeit f4(df4) ...: %timeit f5(df5) ...: 100 loops, best of 3: 3.96 ms per loop 1000 loops, best of 3: 844 µs per loop 1000 loops, best of 3: 1.62 ms per loop 10000 loops, best of 3: 163 µs per loop 1000 loops, best of 3: 663 µs per loop 
+2
source

Source: https://habr.com/ru/post/1014197/


All Articles