Continuous Collaborative Filtering Using Mahout

Now I see Mahout as a collaboration mechanism with filtering. So far it looks great. We have almost 20M boolean recommendations from 12M different users. According to the Mahout wiki and several topics by Sean Owen , one machine should be enough in this case. Because of this, I decided to go with MySql as a data model and skip the overhead of using Hadoop for now.

One thing eludes me, but what are the best practices for constantly updating recommendations without reading all the data from scratch? We have tens of thousands of new recommendations every day. Although I do not expect it to be processed in real time, I would like it to be processed every 15 minutes or so.

Please describe the deployment approaches based on Mysql and Hadoop. Thanks!

+4
source share
1 answer

Any database is too slow for real-time queries, so any approach involves caching a dataset in memory, which I assume you already do with ReloadFromJDBCDataModel . Just use refresh() to reload it at any time. He should do it in the background. The trick is that downloading a new model will require a lot of memory to work with the old one. You can use your own solutions, which, say, reload the user at a time.

There are no such things as real-time updates on Hadoop. It is best to use Hadoop to fully and correctly batch calculate the results, and then tune them at runtime (imperfectly) based on the new data in the application that holds and maintains the recommendations.

+3
source

Source: https://habr.com/ru/post/1382230/


All Articles