Pandas groupby hour of the day in the dictionary

I have a pandas data frame as follows:

 date | Item | count ------------------------------------ 2016-12-06 10:45:08 | Item1 | 60 2016-12-06 10:45:08 | Item2 | 145 2016-12-06 09:45:00 | Item1 | 60 2016-12-06 09:44:54 | Item3 | 600 2016-12-06 09:44:48 | Item4 | 15 2016-12-06 11:45:08 | Item1 | 60 2016-12-06 10:45:08 | Item2 | 14 2016-11-06 09:45:00 | Item1 | 62 2016-11-06 09:44:54 | Item3 | 6 2016-11-06 09:44:48 | Item4 | 15 

I am trying to group items to say hour of the day (or later than all day) in order to find out the following statistics: a list of items sold per day, for example:

  • On 2016-12-06 , from 09:00:00 to 10:00:00 , Item1, Item3 and Item4 were sold; etc.
  • In 2016-12-06 , Item1, Item2, Item3, Item4 (unique items) were sold.

While I am far from collecting these statistics, I linger with grouping by time. Initially print df.dtypes showed

 date object Item object count int64 dtype: object 

So, I used the following line of code to convert a date column to a pandas date object.

 df['date'] = pd.to_datetime(df['date']) 

and now print df.dtypes gives:

 date datetime64[ns] Item object count int64 dtype: object 

However, when I try to group the date column using TimeGrouper by executing the following lines of code

 from pandas.tseries.resample import TimeGrouper print df.groupby([df['date'],pd.TimeGrouper(freq='Min')]) 

I get the following TypeError . As suggested here or here , conversion using pd.to_datetime was supposed to solve this problem.

 TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' 

I do not know how to solve this problem in order to go to the statistics that I am looking for. Any advice on resolving this error and using TimeGrouper to find statistics, preferably in a dictionary format (or anything that makes more sense) will be greatly appreciated.

+5
source share
2 answers

You can use groupby numpy array - datetimes with minutes and seconds removed:

 print (df['date'].values.astype('<M8[h]')) ['2016-12-06T10' '2016-12-06T10' '2016-12-06T09' '2016-12-06T09' '2016-12-06T09' '2016-12-06T11' '2016-12-06T10' '2016-11-06T09' '2016-11-06T09' '2016-11-06T09'] print (df.groupby(df['date'].values.astype('<M8[h]')).Item.unique()) 2016-11-06 09:00:00 [Item1, Item3, Item4] 2016-12-06 09:00:00 [Item1, Item3, Item4] 2016-12-06 10:00:00 [Item1, Item2] 2016-12-06 11:00:00 [Item1] Name: Item, dtype: object print (df.groupby(df['date'].values.astype('<M8[h]')).Item .apply(lambda x: x.unique().tolist()).to_dict()) {Timestamp('2016-11-06 09:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 09:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item2'], Timestamp('2016-12-06 11:00:00'): ['Item1']} 

 print (df.groupby(df['date'].values.astype('<M8[D]')).Item .apply(lambda x: x.unique().tolist()).to_dict()) {Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']} 

Thanks to Jeff for using round :

 print (df.groupby(df['date'].dt.round('h')).Item .apply(lambda x: x.unique().tolist()).to_dict()) {Timestamp('2016-11-06 10:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 12:00:00'): ['Item1'], Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 11:00:00'): ['Item1', 'Item2']} print (df.groupby(df['date'].dt.round('d')).Item .apply(lambda x: x.unique().tolist()).to_dict()) {Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']} 
+3
source
 sold = df.set_index('date').Item.resample('H').agg({'Sold': 'unique'}) sold[sold.Sold.str.len() > 0] Sold date 2016-11-06 09:00:00 [Item4, Item3, Item1] 2016-12-06 09:00:00 [Item4, Item3, Item1] 2016-12-06 10:00:00 [Item1, Item2] 2016-12-06 11:00:00 [Item1] 
+3
source

Source: https://habr.com/ru/post/1263719/


All Articles