Why does pandas.cut () behave differently in a unique count in two similar cases?

In the first case, I use a very simple DataFrame to try to use pandas.cut() to count the number of unique values ​​in one column within the range of another column. The code works as expected:

enter image description here

However, in the following code, pandas.cut() counts the number of unique values. I expect the first bit (1462320000, 1462406400] to have 5 unique values, and other bins, including the last bit (1462752000, 1462838400), to have 0 unique values.

Instead, as shown in the result, the code returns 5 unique values ​​in the last hopper (1462752000, 1462838400], while 2 highlighted values ​​should not be taken into account because they are out of range.

enter image description here

So can anyone explain why pandas.cut() behaves so strongly in these two cases? In addition, I would be very grateful if you could also tell me how I can correct the code to correctly count the number of unique values ​​in one column in the range of values ​​of another column.


MORE INFO: (please import pandas and numpy to run the code, my version of pandas is 0.19.2, and I am using python 2.7)

For complete help, I provide my DataFrame and codes for playing my code:

Case 1:

 df = pd.DataFrame({'No': [1,1.5,2,1,3,5,10], 'useragent': ['a', 'c', 'b', 'c', 'b','a','z']}) print type(df) print df df.groupby(pd.cut(df['No'], bins=np.arange(0,4,1))).useragent.nunique() 

Case 2:

 print type(df) print len(df) print df.time.nunique() print df.hash.nunique() print df[['time','hash']] df.groupby(pd.cut(df['time'], bins =np.arange(1462320000,1462924800,86400))).hash.nunique() 

Case 2 data:

 time hash 1462328401 qo 1462328401 qQ 1462838401 q1 1462328401 q1 1462328401 qU 1462328401 qU 1462328401 qU 1462328401 qU 1462328401 qX 1462838401 qX 
+6
source share
1 answer

This seems to be a mistake .

On a simple example:

 In [50]: df=pd.DataFrame({'atime': [28]*8+[38]*2, 'hash':randint(0,3,10)} ).sort_values('hash') Out[50]: atime hash 1 28 0 3 28 0 4 28 0 5 28 0 8 38 0 2 28 1 6 28 1 0 28 2 7 28 2 9 38 2 In [50bis;)]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.unique() Out[50bis]: atime (27, 29] [0, 1, 2] # ok (29, 31] [] (31, 33] [] (33, 35] [] (35, 37] [] (37, 39] [0, 2] Name: hash, dtype: object In [51]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.nunique() Out[51]: atime (27, 29] 2 # bug (29, 31] 0 (31, 33] 0 (33, 35] 0 (35, 37] 0 (37, 39] 2 Name: hash, dtype: int64 

This seems to be an efficient workaround converting the result of cutting to a list:

 In [52]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2)).tolist() ).hash.nunique() Out[52]: atime (27, 29] 3 (37, 39] 2 Name: hash, dtype: int64 
+2
source

Source: https://habr.com/ru/post/1015125/


All Articles