The most efficient way to accidentally exclude values in a dataframe

Question

The most efficient way to accidentally exclude values in a dataframe

consider df

df = pd.DataFrame(np.ones((10, 10)) * 2,
                  list('abcdefghij'), list('ABCDEFGHIJ'))
df

How can I accidentally invalidate ~ 20% of these values?

+4

python numpy pandas

piRSquared Oct 24 '16 at 18:11

source share

2 answers

You can use stackand unstackwith samplewhich faction you want to sample, represents the share of non-zero values that you want in the resulting frame (ie, one minus the proportion of null values).

df = df.stack().sample(frac=0.8).unstack()

Result:

     A    B    C    D    E    F    G    H    I    J
a  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0  2.0  2.0
b  2.0  NaN  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0
c  2.0  NaN  NaN  2.0  2.0  2.0  NaN  2.0  2.0  2.0
d  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0
e  2.0  2.0  2.0  2.0  2.0  NaN  2.0  NaN  2.0  NaN
f  2.0  2.0  NaN  NaN  2.0  NaN  2.0  2.0  2.0  2.0
g  2.0  2.0  NaN  2.0  NaN  2.0  2.0  2.0  2.0  2.0
h  2.0  2.0  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0
i  NaN  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0
j  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0  2.0

EDIT:

, , /, NaN. , reindex :

df = df.stack().sample(frac=0.8).unstack().reindex(index=df.index, columns=df.columns)

+4

root 24 . '16 18:26

ASGM · Accepted Answer · 2016-10-24T18:28:28+0000

You can use numpy.random.choice to create a mask :

import numpy as np

mask = np.random.choice([True, False], size=df.shape, p=[.2,.8])

df.mask(mask)

In one line:

df.mask(np.random.choice([True, False], size=df.shape, p=[.2,.8]))

Speed tested using timeitat ~ 770μs:

>>> python -m timeit -n 10000 
        -s "import pandas as pd;import numpy as np;df=pd.DataFrame(np.ones((10,10))*2)"
        "df.mask(np.random.choice([True,False], size=df.shape, p=[.2,.8]))"
10000 loops, best of 3: 770 usec per loop

The most efficient way to accidentally exclude values ​​in a dataframe

More articles:

The most efficient way to accidentally exclude values in a dataframe