I have performance issues with pre-affinity of product items in Mahout.
I have 4 million users with approximately the same number of elements, with custom element settings of 100 MB in size. I want to make a content-based recommendation based on the similarity of the cosines of the vectors of TF-IDF documents. Since this is calculated slowly on the fly, I previously calculated the pair similarity of the 50 most similar documents as follows:
- I used
seq2sparseto create TF-IDF vectors. - I used
mahout rowIdto create the mahout matrix - I used mahout
rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -essto create the 50 most similar documents.
I used hadoop to precompute all this. For 4 million elements, the output was only 2.5 GB.
I then uploaded the contents of the files created by the reducers to Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ..., using docIndexto decode the identifiers of the documents. They were already integers, but rowId decrypted them starting at 1, so I have to return it.
For recommendation, I use the following code:
ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);
I am trying to use a limited data model (1.6M elements), but I loaded all the paired similarities of the elements in memory. I am able to load everything in main memory using 40 GB.
When I want to make a recommendation for one user
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);
554.938583083 , , , . . CandidateItemsStrategy MostSimilarItemsCandidateItemsStrategy, .
, ?
-, , , , .
, ? 2,5 40 Collection<GenericItemSimilarity.ItemItemSimilarity> mahout?. , IntWritable, VectorWritable hashMap, ItemItemSimilarity, , ?
.