Pandas, dataframe, groupby, std

New to pandas here. A (trivial) problem: hosts, operations, runtime. I want to group by host, and then use the host + operation to calculate the std deviation for the runtime for each host, and then use the host + operation command. Seems simple?

It works to group by one column:

df Out[360]: <class 'pandas.core.frame.DataFrame'> Int64Index: 132564 entries, 0 to 132563 Data columns (total 9 columns): datespecial 132564 non-null values host 132564 non-null values idnum 132564 non-null values operation 132564 non-null values time 132564 non-null values ... dtypes: float32(1), int64(2), object(6) byhost = df.groupby('host') byhost.std() Out[362]: datespecial idnum time host ahost1.test 11946.961952 40367.033852 0.003699 host1.test 15484.975077 38206.578115 0.008800 host10.test NaN 37644.137631 0.018001 ... 

Nice. Now:

 byhostandop = df.groupby(['host', 'operation']) byhostandop.std() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-364-2c2566b866c4> in <module>() ----> 1 byhostandop.std() /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof) 386 # todo, implement at cython level? 387 if ddof == 1: --> 388 return self._cython_agg_general('std') 389 else: 390 f = lambda x: x.std(ddof=ddof) /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only) 1615 1616 def _cython_agg_general(self, how, numeric_only=True): -> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only) 1618 return self._wrap_agged_blocks(new_blocks) 1619 /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only) 1653 values = com.ensure_float(values) 1654 -> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis) 1656 1657 # see if we can cast the block back to the original dtype /home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis) 838 if is_numeric: 839 result = lib.row_bool_subset(result, --> 840 (counts > 0).view(np.uint8)) 841 else: 842 result = lib.row_bool_subset_object(result, /home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)() ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float' 

Yes?? Why am I getting this exception?

Other questions:

  • how to calculate std deviation on dataframe.groupby([several columns]) ?

  • How can I limit the calculation to a selected column? For instance. obviously, it makes no sense to count std dev for dates / timestamps here.

+4
source share
1 answer

It is important to know your version of Pandas / Python. It looks like this exception can occur in Pandas version 0.10 (see ValueError: buffer dtype mismatch, expected to be 'float64_t', but received 'float' ). To avoid this, you can arrange the float columns in float64 :

 df.astype('float64') 

To compute std() in selected columns just select the columns :)

 >>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4}) >>> df abcg 0 0 10 a 1 1 1 11 b 1 2 2 12 c 1 3 3 13 d 2 4 4 14 e 2 5 5 15 f 2 6 6 16 g 3 7 7 17 h 3 8 8 18 i 3 9 9 19 j 3 >>> df.groupby('g')[['a', 'b']].std() ab g 1 1.000000 1.000000 2 1.000000 1.000000 3 1.290994 1.290994 

Update

As much as possible, it looks like std() is calling aggregation() result of groupby and a subtle error (see here - Python Pandas: Using Aggregate vs Apply to Define New Columns ). To avoid this, you can use apply() :

 byhostandop['time'].apply(lambda x: x.std()) 
+2
source

Source: https://habr.com/ru/post/1201955/


All Articles