Using Python, I calculate the cosine similarity between the elements.
event data that represents a purchase (user, item), I have a list of all the goods bought by my users.
Given this input
(user,item) X,1 X,2 Y,1 Y,2 Z,2 Z,3
I am creating a python dictionary
{1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']}
From this dictionary, I generate a purchased / not-bought matrix, as well as another dictionary (bnb).
{1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]}
From there, I calculate the similarity between (1,2), calculating the cosine between (1,1,0) and (1,1,1), giving 0.816496
I'm doing it:
items=[1,2,3] for item in items: for sub in items: if sub >= item:
I think the brute force approach kills me and it only works slower when the data gets bigger. Using my reliable laptop, this calculation works for several hours when working with 8500 users and 3500 elements.
I am trying to calculate the similarity for all elements in my dict and it will take longer than I would like. I think this is a good candidate for MapReduce, but I am having trouble thinking in terms of key / value pairs.
Alternatively, a problem with my approach and not necessarily a candidate for a map reduction?