I work at PySpark and I would like to find a way to perform linear regressions on data groups. In particular, this data block
import pandas as pd
pdf = pd.DataFrame({'group_id':[1,1,1,2,2,2,3,3,3,3],
'x':[0,1,2,0,1,5,2,3,4,5],
'y':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
df.show()
Now I would like to be able to customize a separate model y ~ ax + bfor each group_id, and output a new framework with columns aand band a row for each group.
For example, for a group, 1I could do:
from sklearn import linear_model
# Regression on group_id = 1
data = df.where(df.group_id == 1).toPandas()
regr = linear_model.LinearRegression()
regr.fit(data.x.values.reshape(len(data),1), data.y.reshape(len(data),1))
a = regr.coef_[0][0]
b = regr.intercept_[0]
print('For group 1, y = {0}*x + {1}'.format(a, b))
# Repeat for group_id=2, group_id=3
But for this, for each group, it is necessary to return the data back to the driver, one of which will not use any Spark parallelism.