Python pandas: how to run multiple one-dimensional group regression

Suppose I have a DataFrame with one column of y variable and many columns of x variables. I would like to be able to run multiple one-dimensional regressions y vs x1 , y vs x2 , ... etc. And save the predictions back to the DataFrame . I also need to do this with a group variable.

 import statsmodels.api as sm import pandas as pd df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return sm.OLS(y, x).fit().predict() df.groupby('grp').apply(ols_res) # This does not work 

The above code is clearly not working. I don’t understand how to pass the fixed function y correctly, apply iterate through the columns x ( x1 , x2 , ...). I suspect there might be a very smart single line solution for this. Any idea?

+6
source share
1 answer

The function you pass to apply must take pandas.DataFrame as the first argument. You can pass additional keywords or positional arguments to apply , which will be passed to the application function. Thus, your example will work with a small modification. Change ols_res to

 def ols_res(df, xcols, ycol): return sm.OLS(df[ycol], df[xcols]).fit().predict() 

Then you can use groupby and apply , like this

 df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y') 

or

 df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y') 

EDIT

In the above code, several one-dimensional regressions are not performed. Instead, one multivariate regression is performed for each group. However, with another (small) modification it will be.

 def ols_res(df, xcols, ycol): return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols}) 

EDIT 2

Although the above solution works, I think the following is a bit more pandas -y

 import statsmodels.api as sm import pandas as pd import numpy as np df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return pd.Series(sm.OLS(y, x).fit().predict()) df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y'])) 

For some reason, if I define ols_res() as it was originally, the resulting DataFrame does not have a group label in the index.

+5
source

Source: https://habr.com/ru/post/971692/


All Articles