Stratified Sampling in Pandas

Question

Stratified Sampling in Pandas

I looked at the sclear stratified sampling documents as well as the pandas docs as well as the stratified samples from Pandas and sclearn column-based stratified sampling , but they do not solve this problem.

Im looking for a quick pandas / sklearn / numpy way to create stratified samples of size n from a dataset. However, for rows with less than the specified sample number, it must accept all records.

Specific example:

Thanks!:)

+11

python numpy pandas scikit-learn

Wboy May 22, '17 at 13:41

source share

2 answers

By spending on groupby answer, we can make sure that the sample is balanced. To do this, when for all classes the number of samples is → = n_samples , we can just take n_samples (previous answer). When a minority class contains < n_samples , we can assume that the number of samples will be the same as for the minority class.

 def stratified_sample_df(df, col, n_samples): n = min(n_samples, df[col].value_counts().min()) df_ = df.groupby(col).apply(lambda x: x.sample(n)) df_.index = df_.index.droplevel(0) return df_

0

Ilya prokin Dec 04 '18 at 14:58

source share

piRSquared · Accepted Answer · 2017-05-22T14:20:48+0000

Use min when passing the number to the pattern. Consider a df data block

 df = pd.DataFrame(dict( A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4], B=range(10) )) df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2))) AB 1 1 1 2 1 2 3 2 3 6 2 6 7 3 7 9 4 9 8 4 8

Stratified Sampling in Pandas

More articles: