The most efficient way to accidentally exclude values ​​in a dataframe

consider df

df = pd.DataFrame(np.ones((10, 10)) * 2,
                  list('abcdefghij'), list('ABCDEFGHIJ'))
df

enter image description here

How can I accidentally invalidate ~ 20% of these values?

enter image description here

+4
source share
2 answers

You can use numpy.random.choice to create a mask :

import numpy as np

mask = np.random.choice([True, False], size=df.shape, p=[.2,.8])

df.mask(mask)

In one line:

df.mask(np.random.choice([True, False], size=df.shape, p=[.2,.8]))

Speed ​​tested using timeitat ~ 770ΞΌs:

>>> python -m timeit -n 10000 
        -s "import pandas as pd;import numpy as np;df=pd.DataFrame(np.ones((10,10))*2)"
        "df.mask(np.random.choice([True,False], size=df.shape, p=[.2,.8]))"
10000 loops, best of 3: 770 usec per loop
+8
source

You can use stackand unstackwith samplewhich faction you want to sample, represents the share of non-zero values that you want in the resulting frame (ie, one minus the proportion of null values).

df = df.stack().sample(frac=0.8).unstack()

Result:

     A    B    C    D    E    F    G    H    I    J
a  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0  2.0  2.0
b  2.0  NaN  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0
c  2.0  NaN  NaN  2.0  2.0  2.0  NaN  2.0  2.0  2.0
d  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0
e  2.0  2.0  2.0  2.0  2.0  NaN  2.0  NaN  2.0  NaN
f  2.0  2.0  NaN  NaN  2.0  NaN  2.0  2.0  2.0  2.0
g  2.0  2.0  NaN  2.0  NaN  2.0  2.0  2.0  2.0  2.0
h  2.0  2.0  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0
i  NaN  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0
j  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0  2.0

EDIT:

, , /, NaN. , reindex :

df = df.stack().sample(frac=0.8).unstack().reindex(index=df.index, columns=df.columns)
+4

Source: https://habr.com/ru/post/1658728/


All Articles