Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

Question

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

I am currently working on a system that generates product recommendations, for example, on Amazon : "People who bought this also bought this."

Current scenario:

Extract the customer’s Google Analytics data and paste it into the database.
On the customer’s website, when the product page is loaded, an API call is made to receive recommendations for the product being viewed.
When the API receives the product identifier as a request, it searches the database and extracts (using association rules) the recommended product identifiers and sends them as a response.
A list of these product identifiers will be processed to obtain product information (image, price ...) on the client side and display on the website.
I am currently using PHP and MYSQL with the gapi package and REST storage api on AMAZON EC2.

My question is: Now, if I need to choose one of the following options, which will be the best choice for implementing the above concept.

PHP with SimpleDB or BIGQuery.
R-language with BIGQuery.
RHIPE- (R and hadoop) with SimpleDB.
Apache Mahout.

Help Plese!

+6

r amazon-simpledb hadoop mahout google-bigquery

samridhi Aug 19 '11 at 12:33

source share

2 answers

If you want to use the real-time API for recommendations based on data in the database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel , put a GenericItemBasedRecommender on top and use the servlet-based wrapper in the examples module. It probably takes a day or two to familiarize yourself with the code and customize it for your needs, but it's pretty simple.

When you go through about 100 M of data, you will need to look at the distribution of the Hadoop calculation. This is a little trickier. Mahout has a distributed advisor that you can configure.

+1

Sean owen Aug 20 '11 at 6:16

source share

Iterator · Accepted Answer · 2011-08-19T23:10:18+0000

It is not easy to answer, because the restrictions are quite specialized.

The following considerations may be made:

BIGQuery is not open yet. Thus, with a small usage base, even if you are in a preview environment, it will be more difficult to get improvement tips.
Each of your answers asked a question about the modeling system and storage system. Apache Mahout is not a storage engine, so it will not necessarily work on its own. I used to believe that its implementation of machine learning was a fake of several Google Summer of Code, but I updated this opinion at the suggestion of the commentator. It still seems that it has a rather uneven and spotty coverage of various algorithms, and it is not particularly clear how the components are supported or supported. I urge the evangelist Mahut to address this issue.

As a result, this eliminates the 1st, 2nd, and 4th options.

What I do not quite understand is the need for the real-time server to use Hadoop and RHIPE. This needs to be done in your batch processing to develop recommendation models, and not in real time. I suppose you could use RHIPE as a simple, universal interface to run queries.

I would recommend using RApache instead of RHIPE, because you can preload your packages and models. I don’t see the benefits of using Hadoop in the front, but it would be a very natural rear system to fit the model.

(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R / PHP interfaces (see comments below), but I suspect it would be better to access R via HTTP or TCP / IP.

(Update 2). Turning to the whole process, the main idea that I see is that you can request data from PHP and go to R or, if you want to request from R, look at the link in the comments (before OmegaHat tools) or ask a new question About R and SimpleDB - I'm sure someone else on SO will be able to better understand a particular connection. RApache will allow you to create an instance of many R processes already prepared with loaded packages and data in RAM; thus, you will only need to transfer any data that you need to use for forecasting. If your new data is a small vector, then RApache should be accurate, and this seems to be correct for real-time data.

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

More articles: