Why does pandas.cut () behave differently in a unique count in two similar cases?

Question

Why does pandas.cut () behave differently in a unique count in two similar cases?

In the first case, I use a very simple DataFrame to try to use pandas.cut() to count the number of unique values in one column within the range of another column. The code works as expected:

However, in the following code, pandas.cut() counts the number of unique values. I expect the first bit (1462320000, 1462406400] to have 5 unique values, and other bins, including the last bit (1462752000, 1462838400), to have 0 unique values.

Instead, as shown in the result, the code returns 5 unique values in the last hopper (1462752000, 1462838400], while 2 highlighted values should not be taken into account because they are out of range.

So can anyone explain why pandas.cut() behaves so strongly in these two cases? In addition, I would be very grateful if you could also tell me how I can correct the code to correctly count the number of unique values in one column in the range of values of another column.

MORE INFO: (please import pandas and numpy to run the code, my version of pandas is 0.19.2, and I am using python 2.7)

For complete help, I provide my DataFrame and codes for playing my code:

Case 1:

 df = pd.DataFrame({'No': [1,1.5,2,1,3,5,10], 'useragent': ['a', 'c', 'b', 'c', 'b','a','z']}) print type(df) print df df.groupby(pd.cut(df['No'], bins=np.arange(0,4,1))).useragent.nunique()

Case 2:

 print type(df) print len(df) print df.time.nunique() print df.hash.nunique() print df[['time','hash']] df.groupby(pd.cut(df['time'], bins =np.arange(1462320000,1462924800,86400))).hash.nunique()

Case 2 data:

 time hash 1462328401 qo 1462328401 qQ 1462838401 q1 1462328401 q1 1462328401 qU 1462328401 qU 1462328401 qU 1462328401 qU 1462328401 qX 1462838401 qX

+6

python pandas dataframe

weefwefwqg3 Feb 20 '17 at 14:29

source share

1 answer

BM · Accepted Answer · 2017-02-20T15:53:01+0000

This seems to be a mistake .

On a simple example:

 In [50]: df=pd.DataFrame({'atime': [28]*8+[38]*2, 'hash':randint(0,3,10)} ).sort_values('hash') Out[50]: atime hash 1 28 0 3 28 0 4 28 0 5 28 0 8 38 0 2 28 1 6 28 1 0 28 2 7 28 2 9 38 2 In [50bis;)]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.unique() Out[50bis]: atime (27, 29] [0, 1, 2] # ok (29, 31] [] (31, 33] [] (33, 35] [] (35, 37] [] (37, 39] [0, 2] Name: hash, dtype: object In [51]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.nunique() Out[51]: atime (27, 29] 2 # bug (29, 31] 0 (31, 33] 0 (33, 35] 0 (35, 37] 0 (37, 39] 2 Name: hash, dtype: int64

This seems to be an efficient workaround converting the result of cutting to a list:

 In [52]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2)).tolist() ).hash.nunique() Out[52]: atime (27, 29] 3 (37, 39] 2 Name: hash, dtype: int64

Why does pandas.cut () behave differently in a unique count in two similar cases?

More articles: