New to pandas here. A (trivial) problem: hosts, operations, runtime. I want to group by host, and then use the host + operation to calculate the std deviation for the runtime for each host, and then use the host + operation command. Seems simple?
It works to group by one column:
df Out[360]: <class 'pandas.core.frame.DataFrame'> Int64Index: 132564 entries, 0 to 132563 Data columns (total 9 columns): datespecial 132564 non-null values host 132564 non-null values idnum 132564 non-null values operation 132564 non-null values time 132564 non-null values ... dtypes: float32(1), int64(2), object(6) byhost = df.groupby('host') byhost.std() Out[362]: datespecial idnum time host ahost1.test 11946.961952 40367.033852 0.003699 host1.test 15484.975077 38206.578115 0.008800 host10.test NaN 37644.137631 0.018001 ...
Nice. Now:
byhostandop = df.groupby(['host', 'operation']) byhostandop.std() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-364-2c2566b866c4> in <module>() ----> 1 byhostandop.std() /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof) 386 # todo, implement at cython level? 387 if ddof == 1: --> 388 return self._cython_agg_general('std') 389 else: 390 f = lambda x: x.std(ddof=ddof) /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only) 1615 1616 def _cython_agg_general(self, how, numeric_only=True): -> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only) 1618 return self._wrap_agged_blocks(new_blocks) 1619 /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only) 1653 values = com.ensure_float(values) 1654 -> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis) 1656 1657 # see if we can cast the block back to the original dtype /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis) 838 if is_numeric: 839 result = lib.row_bool_subset(result, --> 840 (counts > 0).view(np.uint8)) 841 else: 842 result = lib.row_bool_subset_object(result, /home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)() ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Yes?? Why am I getting this exception?
Other questions:
how to calculate std deviation on dataframe.groupby([several columns]) ?
How can I limit the calculation to a selected column? For instance. obviously, it makes no sense to count std dev for dates / timestamps here.
source share