Stratified Sampling in Pandas

I looked at the sclear stratified sampling documents as well as the pandas docs as well as the stratified samples from Pandas and sclearn column-based stratified sampling , but they do not solve this problem.

Im looking for a quick pandas / sklearn / numpy way to create stratified samples of size n from a dataset. However, for rows with less than the specified sample number, it must accept all records.

Specific example:

enter image description here

Thanks!:)

+11
source share
2 answers

Use min when passing the number to the pattern. Consider a df data block

 df = pd.DataFrame(dict( A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4], B=range(10) )) df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2))) AB 1 1 1 2 1 2 3 2 3 6 2 6 7 3 7 9 4 9 8 4 8 
+30
source

By spending on groupby answer, we can make sure that the sample is balanced. To do this, when for all classes the number of samples is → = n_samples , we can just take n_samples (previous answer). When a minority class contains < n_samples , we can assume that the number of samples will be the same as for the minority class.

 def stratified_sample_df(df, col, n_samples): n = min(n_samples, df[col].value_counts().min()) df_ = df.groupby(col).apply(lambda x: x.sample(n)) df_.index = df_.index.droplevel(0) return df_ 
0
source

Source: https://habr.com/ru/post/1268102/


All Articles