Pandas qcut does not put an equal number of observations in each cell

I have a data frame from which I can select a column (row) as follows:

DF:

value_rank 275488 90 275490 35 275491 60 275492 23 275493 23 275494 34 275495 75 275496 40 275497 69 275498 14 275499 83 ... ... 

value_rank is a previously created percentile rank from a larger dataset. What I'm trying to do is create bins for this dataset, for example. quantile

 pd.qcut(df.value_rank, 5, labels=False) 275488 4 275490 1 275491 3 275492 1 275493 1 275494 1 275495 3 275496 2 ... ... 

It seems beautiful, as expected, but it is not.

I actually have 1569 columns. The nearest number, divided by 5 bins, is 1565, which should give 1565/5 = 313 observations in each box. There are 4 additional entries, so I expect to have 4 bins with 314 observations and one with 313 observations. Instead, I get the following:

 obs = pd.qcut(df.value_rank, 5, labels=False) obs.value_counts() 0 329 3 314 1 313 4 311 2 302 

I don't have nans in df and I can't think of any reason why this is happening. Literally started tearing my hair!

Here is a small example:

DF:

  value_rank 286742 11 286835 53 286865 40 286930 31 286936 45 286955 27 287031 30 287111 36 287269 30 287310 18 

pd.qcut gives the following:

 pd.qcut(df.value_rank, 5, labels = False).value_counts() bin count 1 3 4 2 3 2 0 2 2 1 

Each box should have 2 observations, not 3 in hopper 1 and 1 in hopper 2!

+6
source share
3 answers

qcut tries to compensate for duplicate values. This is an earlier render if you return bin limits along with qcut results:

 In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ] In [43]: test_series = pd.Series(test_list, name='value_rank') In [49]: pd.qcut(test_series, 5, retbins=True, labels=False) Out[49]: (array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]), array([ 11. , 25.2, 30. , 33. , 41. , 53. ])) 

You can see that there was no choice but to set the bin limit to 30, so qcut had to "steal" one of the expected values ​​in the third bin and put them in the second. I think this happens on a larger scale with your percentiles, since you basically collapse your series on a scale of 1 to 100. Any reason to not just run qcut directly on the data, rather than percentiles or return percentiles that are more accurate?

+3
source

If you should get equal (or almost equal) bins, then here is a trick you can use with qcut. Using the same data as the accepted answer, we can force them into equal bunkers by adding some random noise to the original test list and binning according to these values.

 test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ] np.random.seed(42) #set this for reproducible results test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data test_series = pd.Series(test_list_rnd, name='value_rank') pd.qcut(test_series, 5, retbins=True, labels=False) 

Output:

 (0 0 1 0 2 1 3 2 4 1 5 2 6 3 7 3 8 4 9 4 Name: value_rank, dtype: int64, array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016, 41.81316392, 53.70807258])) 

So now we have two 0, two 1, two 2 and two 4!

Renouncement

Obviously, use this at your discretion, because the results may vary depending on your data; for example, how large is your dataset and / or interval, for example. The above β€œtrick” works well for integers, because although we β€œstuff” test_list, it will still order in the sense that there will not be a value in group 0 that exceeds the value in group 1 (possibly equal, but not more). If, however, you have floats, this can be difficult, and you may need to reduce the size of your noise accordingly. For example, if you had floating values, such as 2.1, 5.3, 5.3, 5.4, etc., you should reduce the noise by dividing it by 10: np.random.random (len (test_list)) / 10. If you swim arbitrarily for a long time, but you probably will not have this problem, first of all, given the noise already present in the "real" data.

+1
source

Just try the following code:

 pd.qcut(df.rank(method='first'),nbins) 
0
source

Source: https://habr.com/ru/post/1242226/


All Articles