Python pandas: how to run multiple one-dimensional group regression

Question

Python pandas: how to run multiple one-dimensional group regression

Suppose I have a DataFrame with one column of y variable and many columns of x variables. I would like to be able to run multiple one-dimensional regressions y vs x1 , y vs x2 , ... etc. And save the predictions back to the DataFrame . I also need to do this with a group variable.

 import statsmodels.api as sm import pandas as pd df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return sm.OLS(y, x).fit().predict() df.groupby('grp').apply(ols_res) # This does not work

The above code is clearly not working. I don’t understand how to pass the fixed function y correctly, apply iterate through the columns x ( x1 , x2 , ...). I suspect there might be a very smart single line solution for this. Any idea?

+6

python pandas

ezbentley Jul 03 '14 at 4:12

source share

1 answer

JaminSore · Accepted Answer · 2014-07-03T04:30:50+0000

The function you pass to apply must take pandas.DataFrame as the first argument. You can pass additional keywords or positional arguments to apply , which will be passed to the application function. Thus, your example will work with a small modification. Change ols_res to

 def ols_res(df, xcols, ycol): return sm.OLS(df[ycol], df[xcols]).fit().predict()

Then you can use groupby and apply , like this

 df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')

or

 df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')

EDIT

In the above code, several one-dimensional regressions are not performed. Instead, one multivariate regression is performed for each group. However, with another (small) modification it will be.

 def ols_res(df, xcols, ycol): return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})

EDIT 2

Although the above solution works, I think the following is a bit more pandas -y

 import statsmodels.api as sm import pandas as pd import numpy as np df = pd.DataFrame({ 'y': np.random.randn(20), 'x1': np.random.randn(20), 'x2': np.random.randn(20), 'grp': ['a', 'b'] * 10}) def ols_res(x, y): return pd.Series(sm.OLS(y, x).fit().predict()) df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))

For some reason, if I define ols_res() as it was originally, the resulting DataFrame does not have a group label in the index.

Python pandas: how to run multiple one-dimensional group regression

More articles: