First, the βmodelβ that you mean from Mahout is not a model, but a pre-computed list of recommendations. You can also do this with Spark and compute it in batch recommendations for users and save them wherever you want. This has nothing to do with serializing the model. If you don't want to make real-time updates or scoring, you can stop there and just use Spark for the party just like you do Mahout.
But I agree that in many cases you want to send the model to another place and serve it. As you can see, the other models in Spark are Serializable , but not the MatrixFactorizationModel . (Yes, even if it is marked as such, it will not be serialized.) Similarly, there is standard serialization for predictive models called PMML , but it does not contain a dictionary for a factorized matrix model.
The reason is actually the same. While many predictive models, such as the SVM or the logistic regression model, are just a small set of coefficients, the factorized matrix model is huge, containing two matrices with potentially billions of elements. This is why I think PMML does not have a reasonable encoding for it.
Similarly, in Spark, this means the actual RDD matrices that cannot be serialized directly. You can store these RDDs, re-read them elsewhere using Spark, and manually recreate the MatrixFactorizationModel this way.
You cannot service or update a model using Spark. To do this, you really look at creating code to perform updates and calculate recommendations on the fly.
I do not mind proposing the Oryx project here, since its task is to manage this particular aspect, especially for recommending ALS. In fact, the project
source share