Data sampling based on distribution data

How can I sample the pandas or graphlab data framework based on the given values ​​of the distribution of labels of the class \ label: for example, I want to try a data frame that has a label \ class column to select rows, so that each class label is equally selected thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels. Or it’s best to get samples according to the class distribution we want.

  + ------ + ------- + ------- +
 |  col1 |  clol2 |  class |
 + ------ + ------- + ------- +
 |  4 |  45 |  A |
 + ------ + ------- + ------- +
 |  5 |  66 |  B |
 + ------ + ------- + ------- +
 |  5 |  6 |  C |
 + ------ + ------- + ------- +
 |  4 |  6 |  C |
 + ------ + ------- + ------- +
 |  321 |  1 |  A |
 + ------ + ------- + ------- +
 |  32 |  432 |  B |
 + ------ + ------- + ------- +
 |  5 |  3 |  B |
 + ------ + ------- + ------- +

 given a huge dataframe like above and the required frequency distribution like below:
 + ------- + -------------- +
 |  class |  nostoextract |
 + ------- + -------------- +
 |  A |  2 |
 + ------- + -------------- +
 |  B |  2 |
 + ------- + -------------- +
 |  C |  2 |
 + ------- + -------------- +


The above should extract the rows from the first data frame based on the given frequency distribution in the second frame, where the frequency count values ​​are given in the nostoextract column to get a sampled frame where each class is displayed a maximum of 2 times. should be ignored and continued unless sufficient classes are found to match the required score. The resulting data frame should be used for the classifier based on the decision tree.

As the commentator notes, should a discrete framework contain nostoextract different instances of the corresponding class? If there are not enough examples for this class, then you just take all available.

+5
source share
3 answers

Can you split the first data frame into subframes of class data and then try as you wish?

i.e.

dfa = df[df['class']=='A'] dfb = df[df['class']=='B'] dfc = df[df['class']=='C'] .... 

Then, as soon as you split / create / filter on dfa, dfb, dfc, select the number on the top (if the data does not have a specific sorting pattern)

  dfasamplefive = dfa[:5] 

Or use the sample method, as described by the previous commenter, to directly take an arbitrary sample:

 dfasamplefive = dfa.sample(n=5) 

If this suits your needs, all that remains to be done is to automate the process by supplying a number to be selected from the control data block, which you have as the second data block containing the required number of samples.

+4
source

I think this will solve your problem:

 import pandas as pd data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5], 'clol2':[45, 66, 6, 6, 1, 432, 3], 'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']}) freq = pd.DataFrame({'class':['A', 'B', 'C'], 'nostoextract':[2, 2, 2], }) def bootstrap(data, freq): freq = freq.set_index('class') # This function will be applied on each group of instances of the same # class in `data`. def sampleClass(classgroup): cls = classgroup['class'].iloc[0] nDesired = freq.nostoextract[cls] nRows = len(classgroup) nSamples = min(nRows, nDesired) return classgroup.sample(nSamples) samples = data.groupby('class').apply(sampleClass) # If you want a new index with ascending values # samples.index = range(len(samples)) # If you want an index which is equal to the row in `data` where the sample # came from samples.index = samples.index.get_level_values(1) # If you don't change it then you'll have a multiindex with level 0 # being the class and level 1 being the row in `data` where # the sample came from. return samples print(bootstrap(data,freq)) 

Print

  class clol2 cols1 0 A 45 4 4 A 1 321 1 B 66 5 5 B 432 32 3 C 6 4 2 C 6 5 

If you do not want the result to be sorted by class, you can rearrange it at the end.

+3
source

Here is the solution for power circuits. This is not exactly what you want, because it selectively displays the dots, so the results do not necessarily have exactly the number of lines you specified. The exact method will probably randomly shuffle the data and then take the first lines of k for this class, but this gives you a pretty good close.

 import random import graphlab as gl ## Construct data. sf = gl.SFrame({'col1': [4, 5, 5, 4, 321, 32, 5], 'col2': [45, 66, 6, 6, 1, 432, 3], 'class': ['A', 'B', 'C', 'C', 'A', 'B', 'B']}) freq = gl.SFrame({'class': ['A', 'B', 'C'], 'number': [3, 1, 0]}) ## Count how many instances of each class and compute a sampling # probability. grp = sf.groupby('class', gl.aggregate.COUNT) freq = freq.join(grp, on ='class', how='left') freq['prob'] = freq.apply(lambda x: float(x['number']) / x['Count']) ## Join the sampling probability back to the original data. sf = sf.join(freq[['class', 'prob']], on='class', how='left') ## Sample the original data, then subset. sf['sample_mask'] = sf.apply(lambda x: 1 if random.random() <= x['prob'] else 0) sf2 = sf[sf['sample_mask'] == 1] 

In my run, I managed to get the exact number of samples that I specified, but again, this is not guaranteed by this solution.

 >>> sf2 +-------+------+------+ | class | col1 | col2 | +-------+------+------+ | A | 4 | 45 | | A | 321 | 1 | | B | 32 | 432 | +-------+------+------+ 
+1
source

Source: https://habr.com/ru/post/1233587/


All Articles