How to calculate the similarity between function lists?

Question

How to calculate the similarity between function lists?

I have users and resources. Each resource is described by a set of functions, and each user is associated with a different set of resources. In my particular case, the resources are web pages, as well as information about the functions of the location of the visit, time of visit, number of visits, etc., which each time are associated with a specific user.

I want to get an assessment of the similarity between my users regarding these features, but I cannot find a way to combine the components of the resource. I did this with text functions, since you can add documents together and then extract functions (say TF-IDF), but I don't know how to continue this configuration.

To be as clear as possible, here is what I have:

>>> len(user_features) 13 # that my number of users >>> user_features[0].shape (2374, 17) # 2374 documents for this user, and 17 features

I can get a document similarity matrix using, for example, Euclidean distances:

 >>> euclidean_distance(user_features[0], user_features[0])

But I do not know how to compare users with each other. I have to somehow combine the functions together to get the N_Users X N_Features , but I don't know how to do this.

Any clues on how to proceed?

Additional information about the functions that I use:

The functions that I have are not fully fixed. What I have so far is 13 different functions that are already combined from the “views”. I have standard deviation, mean, etc. For each species, in order to have something “flat” in order to be able to compare them. One of the features that I have is: has the location changed since the last time it was viewed? But what about an hour ago? Two hours ago?

0

python numpy machine-learning

Alexis Métaireau Aug 12 '11 at 15:10

source share

3 answers

You can use the average value of functions in each user set of resources, it seems a natural way to summarize to the user. numpy.mean with the corresponding axis argument should get you the average value, and then calculate the Euclidean distance between the received "user vectors" (length n_features), as before between the vectors of vectors.

0

Fred foo Aug 12 '11 at 15:29

source share

I would look at creating several dimensions of documents, so those documents that are visited at certain times of the day are shared morning and night, and then hire users who are the ultimate owls and early birds.

With any number of dimensions, you can create a user matrix and use the distance between users to help.

0

you cad sir - take that Aug 12 '11 at 15:34

source share

Ruggiero spearman · Accepted Answer · 2011-08-12T16:11:28+0000

If each user is presented as a set of interaction vector with a document, you can define the similarity of a pair of users as the similarity of a pair of sets of interaction vector with a document that represent users.

You say you can get a document similarity matrix. Then, suppose user U1 visited documents D1, D2, D3 and user U2 who visited documents D1, D3, D4. You will have two sets of vectors S1 = {U1 (D1), U1 (D2), U1 (D3)} for user 1 and S2 = {U2 (D1), U2 (D3), U2 (D4)}. Please note that since each interaction with the document is different from each other, they are presented as such. If I understand correctly, the elements of these sets should correspond to the corresponding rows in the matrix of each user.

The similarities between the two sets can be calculated differently. One option is the average mutual similarity: you iterate over all pairs of elements from each set, calculate the document similarity in a pair, and average over all pairs.

How to calculate the similarity between function lists?

More articles: