How is NaN handled in the user similarity matrix of Pearson user correlation in the recommendation system?

I am creating a user-user similarity matrix from user rating data (in particular, MovieLens100K data). The calculation of the correlation leads to some NaN values. I tested in a smaller dataset:

Custom Elements Rating Matrix

I1 I2 I3 I4 U1 4 0 5 5 U2 4 2 1 0 U3 3 0 2 4 U4 4 4 0 0 

User-User Pearson Correlation matrix

  U1 U2 U3 U4 U5 U1 1 -1 0 -nan 0.755929 U2 -1 1 1 -nan -0.327327 U3 0 1 1 -nan 0.654654 U4 -nan -nan -nan -nan -nan U5 0.755929 -0.327327 0.654654 -nan 1 

To calculate pearson correlation, only processed objects between two users are taken into account. (See To the system of the next generation of recommendations: a review of the status and possible extensions, Gediminas Adomavicius, Alexander Tuzhilin

How can I handle NaN values?

EDIT Here is the code with which I find the pearson correlation in R. The matrix R is the user element rating matrix. Contains from 1 to 5 scale rating 0 means that there is no rating. S is a user correlation matrix.

  for (i in 1:nrow (R)) { cat ("user: ", i, "\n"); for (k in 1:nrow (R)) { if (i != k) { corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE); ui <- (R[i,corated_list] - mean (R[i,corated_list])); uk <- (R[k,corated_list] - mean (R[k,corated_list])); temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2)); S[i,k] <- ifelse (is.nan (temp), 0, temp) } else { S[i,k] <- 0; } } } 

Note that in the line S[i,k] <- ifelse (is.nan (temp), 0, temp) I replace NaN with 0.

+6
source share
1 answer

I recently developed a recommendation system in Java for a user and user element matrix. Firstly, as you probably already found. RS are difficult. For my implementation, I used the Apache general math library, which is fantastic, you use R, which is probably relatively similar to how it calculates Pearson's.

Your question: How can I process NaN values ​​and then edit, saying that you say NaN = 0.

My answer is this:

You should not treat NaN values ​​as 0, because you say that there is absolutely no correlation between users or users / elements. It may be so, but most likely it is not always so. Ignoring this will distort your recommendations.

First, you should ask yourself: "Why am I getting NaN values"? The following are some of the reasons on the NaN Wiki page that details why you can get a NaN value:

There are three kinds of operations that can return NaN:

  • Operations with NaN as at least one operand.

  • Undefined forms Sections 0/0 and ± ∞ / ± ∞ Multiplications 0 × ± ∞ and ± ∞ × 0 Additions ∞ + (-∞), (-∞) + ∞ and equivalent subtractions The standard has alternative functions for powers: Standard function pow and the integer exponent count function defines 00, 1∞ and ∞0 as 1. The powr function defines all three indefinite forms as invalid operations and therefore returns NaN.

  • Real operations with complex results, for example: The square root of a negative number. Logarithm of a negative number The inverse sine or cosine of a number that is less than -1 or greater than +1.

You need to debug your application and complete each step to see which of the above causes is the offending cause.

Secondly, the understanding that Pearons Correlation can be represented in several ways, and you need to think about whether you are calculating it from a sample or population, and then find a suitable method for calculating it, i.e. for the population:

cor (X, Y) = Σ [(xi - E (X)) (yi - E (Y))] / [(n - 1) s (X) s (Y)]

where E (X) is the mean value of X, E (Y) is the mean value of Y and s (X), s (Y) are standard deviations and standard deviations are usually the positive square root of the variance and variance = sum ((x_i is the mean value) ^ 2) / (n - 1)

where mean is mean and n is the number of sample observations.

This is probably where your NaNs appear, i.e. division by 0 for unrated. If you can tell me not to use the value 0 to indicate not rated, use null instead. I would do this for two reasons: 1. Probably 0 is what raises your results with NaNs and 2. Readability / Understanding. Your scale is 1 - 5, so 0 should not be displayed, knocks things down. Therefore, avoid this if possible.

Third, from the point of view of the recommender, think about things in terms of recommendations. If you have 2 users and they only have 1 overall rating, say U1 and U4 for I1 in your smaller dataset. Is this 1st subject really enough to offer recommendations? The answer is of course not. Therefore, I can also suggest that you set a minimum threshold for InCommon ratings to ensure the best quality recommendations. The minimum you can set for this threshold is 2, but consider setting it a little higher. If you read the MovieLens study, then they set it within 5-10 (I don’t remember, as if from the head). The higher you set this, the less points you will get, but you will get recommendations “better” (lower error ratings). You probably read academic literature, and you probably probably came up with this question, but thought that I would have mentioned it anyway.

At the point above. Look at U4 and compare with other users. Please note that U4 does not have more than one item with any user. Now, I hope you will notice that NaNs appear exclusively with U4. If you followed this answer, then hopefully now you will see that the reason you get NaNs is because you can actually calculate Pearson in just 1 element :).

Finally, one thing that bothers me a bit about the data set of the example above is the number of correlations, which are 1 and -1. Think about what really says these user preferences, and then check them against actual ratings. For instance. look at ratings U1 and U2. for position 1 they have a strong positive correlation 1 (both rate it as 4), for position 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), it seems strange that Pearson Correlation between these two users is 1 (i.e. their preferences are completely opposite). This is clearly not the case, indeed, Pearson's score should be slightly higher or slightly lower than 0. This problem is related to the moments of using 0 on the scale, and also compares only a small number of objects together.

Now there are strategies to “populate” items that users have not rated. I am not going to go into them, you need to read about it, but essentially it looks like the average score for this subject or the average rating for this user. Both methods have their drawbacks, and personally, I really don't like either of them. My advice is to calculate Pearson correlations between users when they have 5 or more common elements, and ignore elements where ratings are 0 (or better - null).

So, to conclude.

  • NaN is not 0, so do not set it to 0.
  • 0 on your scale is better represented as null
  • You should only calculate Pearson Correlations when the number of elements shared between two users is> 1, preferably greater than 5/10.
  • Only calculate the Pearson Correlation for two users for whom they usually have rated positions, do not include elements in the rating that were not rated by another user.

Hope this helps and good luck.

+2
source

Source: https://habr.com/ru/post/920202/


All Articles