Convert python collaborative filtering code to use Map Reduce

Using Python, I calculate the cosine similarity between the elements.

event data that represents a purchase (user, item), I have a list of all the goods bought by my users.

Given this input

(user,item) X,1 X,2 Y,1 Y,2 Z,2 Z,3 

I am creating a python dictionary

 {1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']} 

From this dictionary, I generate a purchased / not-bought matrix, as well as another dictionary (bnb).

 {1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]} 

From there, I calculate the similarity between (1,2), calculating the cosine between (1,1,0) and (1,1,1), giving 0.816496

I'm doing it:

 items=[1,2,3] for item in items: for sub in items: if sub >= item: #as to not calculate similarity on the inverse sim = coSim( bnb[item], bnb[sub] ) 

I think the brute force approach kills me and it only works slower when the data gets bigger. Using my reliable laptop, this calculation works for several hours when working with 8500 users and 3500 elements.

I am trying to calculate the similarity for all elements in my dict and it will take longer than I would like. I think this is a good candidate for MapReduce, but I am having trouble thinking in terms of key / value pairs.

Alternatively, a problem with my approach and not necessarily a candidate for a map reduction?

+4
source share
1 answer

This is not really a MapReduce function, but it should give you significant speedup without any problems.

I would use numpy to "vectorize" the operation and make your life easier. From this, you just need to skip this dictionary and apply the vectorized function, comparing this element with all the others.

 import numpy as np bnb_items = bnb.values() for num in xrange(len(bnb_items)-1): sims = cosSim(bnb_items[num], bnb_items[num+1:] def cosSim(User, OUsers): """ Determinnes the cosine-similarity between 1 user and all others. Returns an array the size of OUsers with the similarity measures User is a single array of the items purchased by a user. OUsers is a LIST of arrays purchased by other users. """ multidot = np.vectorize(np.vdot) multidenom = np.vectorize(lambda x: np.sum(x)*np.sum(User)) #apply the dot-product between this user and all others num = multidot(OUsers, User) #apply the magnitude multiplication across this user and all others denom = multidenom(OUsers) return num/denom 

I have not tested this code, so there might be some dumb mistakes, but the idea should give you 90% of the way.

This should have BASIC acceleration. If you still need speed, there is a wonderful blog entry that implements the Slopes One recommendation system here .

Hope this helps, Will

+6
source

Source: https://habr.com/ru/post/1310484/


All Articles