Stratified Samples from Pandas

I have a pandas DataFrame that looks something like this:

cli_id | X1 | X2 | X3 | ... | Xn | Y | ---------------------------------------- 123 | 1 | A | XX | ... | 4 | 0.1 | 456 | 2 | B | XY | ... | 5 | 0.2 | 789 | 1 | B | XY | ... | 5 | 0.3 | 101 | 2 | A | XX | ... | 4 | 0.1 | ... 

I have a client identifier, several categorical attributes and Y, which is the probability of an event that has values ​​from 0 to 1 by 0.1.

I need to take a stratified sample in each group (so 10 times) Y size 200

I often use this to take a stratified sample when splitting into a train / test:

 def stratifiedSplit(X,y,size): sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0) for train_index, test_index in sss: X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] return X_train, X_test, y_train, y_test 

But I do not know how to change it in this case.

+3
source share
2 answers

I'm not quite sure if you mean this:

 strats = [] for k in range(11): y_val = k*0.1 dummy_df = your_df[your_df['Y'] == y_val] stats.append( dummy_df.sample(200) ) 

This makes a dummy framework consisting only of the Y values ​​you want, and then takes a sample of 200.

OK, so you need different pieces in order to have the same structure. I think this is a little more complicated, here is how I would do it:

First of all, I would get a histogram of what X1 looks like:

 hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins)) 

we have a bar chart with nbins bins.

Now the strategy is to draw a certain number of lines depending on what their value is X1 . We will extract more from the bins with more observations and less from the bins with less, so that the structure X preserved.

In particular, the relative contribution of each bin should be:

 rel = [float(i) / sum(hist) for i in hist] 

It will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

If we need 200 samples, we need to draw:

 draws_in_bin = [int(i*200) for i in rel] 

Now we know how many observations need to be made from each bin:

 strats = [] for k in range(11): y_val = k*0.1 #get a dataframe for every value of Y dummy_df = your_df[your_df['Y'] == y_val] bin_strat = [] for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin): bin_df = dummy_df[ (dummy_df['X1']> left_edge) & (dummy_df['X1']< right_edge) ] bin_strat.append(bin_df.sample(n_draws)) # this takes the right number of draws out # of the X1 bin where we currently are # Note that every element of bin_strat is a dataframe # with a number of entries that corresponds to the # structure of draws_in_bin # #concatenate the dataframes for every bin and append to the list strats.append( pd.concat(bin_strat) ) 
+2
source

If the number of samples is the same for each group, or if this proportion is constant for each group, you can try something like

 df.groupby('Y').apply(lambda x: x.sample(n=200)) 

or

 df.groupby('Y').apply(lambda x: x.sample(frac=.1)) 

to perform a stratified sampling with respect to several variables, just a group with respect to more variables. For this, you may need to build new binded variables.

However, if the group size is too small wrt proportions such as grouping 1 and proposition .25, then no item will be returned. This is due to the fact that pythons round off the implementation of the function int int(0.25)=0

+7
source

Source: https://habr.com/ru/post/1268104/


All Articles