Spark - How to use a prepared receptor model in production?

I use Spark to create a prototype recommendation system. After a few lessons, I was able to train the MatrixFactorizationModel from my data.

However, the model prepared by Spark mllib is simply Serializable . How can I use this model for recommendations for real users? I mean, how can I save the model in some database or update it if the user data has been increased?

For example, a model prepared by the Mahout recommendation library can be stored in databases such as Redis, after which we can request a list of recommended items later. But how can we do similar things in Spark? Any suggestion?

+5
source share
4 answers

First, the β€œmodel” that you mean from Mahout is not a model, but a pre-computed list of recommendations. You can also do this with Spark and compute it in batch recommendations for users and save them wherever you want. This has nothing to do with serializing the model. If you don't want to make real-time updates or scoring, you can stop there and just use Spark for the party just like you do Mahout.

But I agree that in many cases you want to send the model to another place and serve it. As you can see, the other models in Spark are Serializable , but not the MatrixFactorizationModel . (Yes, even if it is marked as such, it will not be serialized.) Similarly, there is standard serialization for predictive models called PMML , but it does not contain a dictionary for a factorized matrix model.

The reason is actually the same. While many predictive models, such as the SVM or the logistic regression model, are just a small set of coefficients, the factorized matrix model is huge, containing two matrices with potentially billions of elements. This is why I think PMML does not have a reasonable encoding for it.

Similarly, in Spark, this means the actual RDD matrices that cannot be serialized directly. You can store these RDDs, re-read them elsewhere using Spark, and manually recreate the MatrixFactorizationModel this way.

You cannot service or update a model using Spark. To do this, you really look at creating code to perform updates and calculate recommendations on the fly.

I do not mind proposing the Oryx project here, since its task is to manage this particular aspect, especially for recommending ALS. In fact, the project

+8
source

Another method for creating speeches with Spark is the search engine method. This is basically a cooccurrence recommendation made by Solr or Elasticsearch. Comparing the factored with cooccurrence is beyond the scope of this question, so I will simply describe the latter.

You load the interactions (user id, item id) into Mahout spark-itemsimilarity . This creates a list of similar items for each item displayed in the interaction data. It will be displayed as csv by default and therefore can be saved anywhere. But it must be indexed by the search engine.

In any case, when you want to get the details, you use the user history as a request, you return an ordered list of elements as details.

One of the advantages of this method is that indicators can be calculated for any number of user actions that you want. Any actions that the user takes that correlate with what you want to recommend can be used. For example, if you want to recommend a purchase, but also record the presentation of the product. If you viewed product types as well as purchases, you are likely to get worse reviews (I tried this). However, if you calculate an indicator for purchases and another (actually cross) indicator for product views, they predict purchases equally. This leads to an increase in the amount of data used for recents. You can do the same with custom locations to insert location information into your purchase details.

You can also shift your replicas based on context. If you are in the "Electronics" section of the catalog, you may want the edges to be skewed towards the electronics. Add electronics to the request to the category metadata field and give it a boost in the request, and you have biased replicas.

Since all the shift and mixing of the indicators occurs in the query, it makes the relay mechanism easily tuned to several contexts, while retaining only one multi-field query created through a search engine. We get scalability from Solr or Elasticsearch.

Another advantage of factorization or the search method is that completely new users and a new story can be used to create reviewers, where senior Mahout recommendations can only be recommended to users and interactions that are known to complete the assignment.

Descriptions here:

+2
source

You should run model.predictAll () on a reduced set of RDD (custom, product) pairs, for example, in the task of Mach Hadoop and save the results for use on the Internet ...

https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java

0
source

You can use the .save function (sparkContext, outputFolder) to save the model to a folder of your choice. Providing real-time recommendations, you just need to use the MatrixFactorizationModel.load (sparkContext, modelFolder) function to load it as a MatrixFactorizationModel object.

Question to @Sean Owen : does the matrix contain Factorization matrix matrix objects: matrices of user-defined functions and elements, not recommendations / projected ratings.

0
source

Source: https://habr.com/ru/post/1210531/


All Articles