How to get pandas frames of data subsets

I would like to get a subset of the dataframe in rolling mode. I tried several things without success, here is an example of what I would like to do. Consider the consideration of data.

df var1 var2 0 43 74 1 44 74 2 45 66 3 46 268 4 47 66 

I would like to create a new column with the following function that executes a conditional sum:

 def func(x): tmp = (x["var1"] * (x["var2"] == 74)).sum() return tmp 

and calling it as follows

 df["newvar"] = df.rolling(2, min_periods=1).apply(func) 

This would mean that the function would be applied based on the data, and not for each row or column

He will return

  var1 var2 newvar 0 43 74 43 # 43 1 44 74 87 # 43 * 1 + 44 * 1 2 45 66 44 # 44 * 1 + 45 * 0 3 46 268 0 # 45 * 0 + 46 * 0 4 47 66 0 # 46 * 0 + 47 * 0 

Is there a pythonic way to do this? This is just an example, but a condition (always based on the values ​​of the data sub-data depends on more than two columns.

+3
source share
2 answers

updated comment

@unutbu wrote a great answer to a very similar question here , but it seems his answer is based on pd.rolling_apply , which passes the index of the function. I'm not sure how to replicate this using the current DataFrame.rolling.apply method.

original answer

It looks like the variable passed to the argument through the apply function is an array with the number of each column (one at a time), not a DataFrame, so you don't have access to any other columns, unfortunately.

But you can use some logic to temporarily create a new column based on whether var2 74 or not, and then use the crop method.

 df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum() var1 var2 new_var 0 43 74 43.0 1 44 74 87.0 2 45 66 44.0 3 46 268 0.0 4 47 66 0.0 

The time column is based on the first half of the code above.

 df.var2.eq(74).mul(df.var1) # or equivalently with operators # (df['var2'] == 74) * df['var1'] 0 43 1 44 2 0 3 0 4 0 

Search for the type of variable passed for use

It is very important to know what is actually passed to the apply function, and I can’t always remember what is passed, so if I'm not sure, I will print the variable along with its type so that it is clear to me what kind of object I'm dealing with. See this example with the original DataFrame.

 def foo(x): print(x) print(type(x)) return x.sum() df.rolling(2, min_periods=1).apply(foo) 

Output

 [ 43.] <class 'numpy.ndarray'> [ 43. 44.] <class 'numpy.ndarray'> [ 44. 45.] <class 'numpy.ndarray'> [ 45. 46.] <class 'numpy.ndarray'> [ 46. 47.] <class 'numpy.ndarray'> [ 74.] <class 'numpy.ndarray'> [ 74. 74.] <class 'numpy.ndarray'> [ 74. 66.] <class 'numpy.ndarray'> [ 66. 268.] <class 'numpy.ndarray'> [ 268. 66.] <class 'numpy.ndarray'> 
+3
source

The trick is to define a function that has access to your entire data frame. Then you check any column and call apply() , passing this function. The function will have access to window data, which is a subset of the data column. From this subset, you can extract the index that you should look at. (This assumes that your index is strictly increasing. In this way, a regular integer index will work like most time series.) You can use the index to then access the entire data frame with all columns.

 def dataframe_roll(df): def my_fn(window_series): window_df = df[(df.index >= window_series.index[0]) & (df.index <= window_series.index[-1])] return window_df["col1"] + window_df["col2"] return my_fn df["result"] = df["any_col"].rolling(24).apply(dataframe_roll(df), raw=False) 
+1
source

Source: https://habr.com/ru/post/981662/


All Articles