How to get mode for a string variable when re-fetching with pandas

Question

How to get mode for a string variable when re-fetching with pandas

I am trying to reprogram a pandas data frame with an hourly timestamp index. I'm interested in getting the most common value for a column with string values. However, the built-in re-sampling functions of the time series do not include the mode as one of the default methods for re-sampling (since this means "means" and "count"). I tried to define my own function and pass this function, but it does not work. I also tried using the np.bincount function, but it does not work, since I work with strings.

Here's what my data looks like:

  station_arrived action lat1 lon1 date_removed 2012-01-01 13:12:00 56 A 19.4171 -99.16561 2012-01-01 13:12:00 56 A 19.4271 -99.16361 2012-01-01 15:41:00 56 A 19.4171 -99.16561 2012-01-02 08:41:00 56 C 19.4271 -99.16561 2012-01-02 11:36:00 56 C 19.2171 -99.16561

This is my code:

 def mode1(algo): common=[ite for ite, it in Counter(algo).most_common(1)] # Returns all unique items and their counts return common hourlycount2 = travels2012.resample('H', how={'station_arrived': 'count', 'action': mode(travels2012['action']), 'lat1':'count', 'lon1':'count'}) hourlycount2.head()

I see the following error:

 Traceback (most recent call last): File "<stdin>", line 3, in <module> File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\generic.py", line 2836, in resample return sampler.resample(self).__finalize__(self) File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 83, in resample rs = self._resample_timestamps() File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 277, in _resample_timestamps result = grouped.aggregate(self._agg_method) File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2404, in aggregate result[col] = colg.aggregate(agg_how) File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2076, in aggregate ret = self._aggregate_multiple_funcs(func_or_funcs) File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2125, in _aggregate_multiple_funcs results[name] = self.aggregate(func) File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2073, in aggregate return getattr(self, func_or_funcs)(*args, **kwargs) File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 486, in __getattr__ (type(self).__name__, attr)) AttributeError: 'SeriesGroupBy' object has no attribute 'A '

+5

python pandas time-series

asado23 Oct 2 '14 at 21:57

source share

1 answer

Andy hayden · Accepted Answer · 2014-10-02T22:47:52+0000

The values in the dict must be either strings representing functions (e.g. 'count' / 'sum' / 'max'), or functions that are passed to each group. What you are going through is the result (value) of mode(travels2012['action']) .

So, you need to make this a function that applies to each group:

 In [11]: df.resample('H', how={'station_arrived':'count', 'action': lambda x: mode(df['action']), 'lat1':'count', 'lon1':'count'}) Out[11]: action station_arrived lon1 lat1 date_removed 2012-01-01 13:00:00 [A] 2 2 2 2012-01-01 14:00:00 [A] 0 0 0 2012-01-01 15:00:00 [A] 1 1 1 2012-01-01 16:00:00 [A] 0 0 0 ...

I'm not sure if this is what you want (since it applies to the whole column), maybe you want to take a mode for each group:

 In [12]: df.resample('H', how={'station_arrived':'count', 'action': mode, 'lat1':'count', 'lon1':'count'}) Out[12]: action station_arrived lon1 lat1 date_removed 2012-01-01 13:00:00 [A] 2 2 2 2012-01-01 14:00:00 [] 0 0 0 2012-01-01 15:00:00 [A] 1 1 1 2012-01-01 16:00:00 [] 0 0 0 ...

I would rather see the actual value (A), not in the list, but NaN instead of [].

I think it's worth mentioning the serial mode method, in which there is a caution that it always returns the series (since there may be a draw) and is empty if the value is not displayed more than once.
You can wrap it as follows (and you can wrap the mode function in the same way):

 def mode_(s): try: return s.mode()[0] except IndexError: return np.nan In [22]: df.resample('H', how={'station_arrived':'count', 'action': mode_, 'lat1':'count', 'lon1':'count'}) Out[22]: action station_arrived lon1 lat1 date_removed 2012-01-01 13:00:00 A 2 2 2 2012-01-01 14:00:00 NaN 0 0 0 2012-01-01 15:00:00 NaN 1 1 1 2012-01-01 16:00:00 NaN 0 0 0 ...

How to get mode for a string variable when re-fetching with pandas

More articles: