Incomplete data processing (sparseness of data) in kNN

I am trying to create a simple recommendation system using knn.

Suppose I have a table:

User | Book1 | Book2 | Book3 | Book4 | Book5 | Book6 | Book7 | 1 | 5 | ? | 3 | ? | 4 | 3 | 2 | 2 | 3 | 4 | ? | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 1 | ? | ? | 3 | 3 | 4 | 2 | 5 | 3 | ? | 4 | 1 | 1 | 5 | 1 | 1 | 4 | 3 | 1 | ? | 1 | 6 | 5 | 2 | 5 | 4 | 4 | 2 | ? | 

So, if you can find possible ratings for User 1, I thought that just take the absolute difference between the users of books 1 read with other users. Then I would use this difference to find out which user from this list is β€œcloser” to user 1. But in the real situation there will be more? / Unknown points. So, how can I handle these unknown ratings when using knn?

I have no code, since I have not figured out how to implement this yet.

Any help is appreciated!

+6
source share
4 answers

You do not have "unknown functions", you have incomplete data points.

This is actually a well-known problem in kNN, and there is a carefully tested template for solving the problem.

Although the problem is actually a "incomplete data" problem, in the context of kNN it is often (usually?) Called the sparseness problem.

In practice, the problem of sparseness in the construction of knn models, with the exception of, possibly, efficient storage / retrieval of data included in the model, is key to kNN.

For example, consider Amazon.com’s recommendation mechanism, in which product ratings as user-defined functions containing columns and users containing rows, so that this matrix is ​​100% complete, every Amazon customer would have to buy and view every Amazon sold. The actual sparseness of this matrix should be> 95%.

The most common method (and which is still the most advanced as far as I know) is known as NNMA or non-negative matrix approximation. This method is also often referred to incorrectly as NNMF, in which F stands for factorization. (NNMA is based on the factorization method, but the result is not a factor in the original data matrix.) I mention this because this alternative term, although incorrect, is widely used, so I would include it in my search queries.

In essence, this technique can be used to remove sparseness from a matrix or, alternatively, to fill in missing cells (i.e., the client on row R did not redo the product of column C).

You can find the full implementation of nnma, including the accompanying tutorial (in python + numpy) on Albert Au Young's Ching-man blog .

In addition, there are several python packages (available through PyPI) that contain packaged code for NNMA. I used only one of them, PyMF , which you can find in Google Code.

So you can see how NNMA works on its magic, here is my simple but complete implementation of NNMA in python + NumPy :

 import numpy as NP def cf(q, v): """ the cost function """ qv = (q - v)**2 return NP.sum(NP.sum(qv, axis=0)) def nnma(d, max_iter=100): x, y = d.shape z = y w = NP.random.rand(x, y) h = NP.random.rand(y, z) for i in range(max_iter): wh = NP.dot(w, h) cost = cf(d, wh) if cost == 0: break hn = NP.dot(wT, d) hd = NP.dot(NP.dot(wT, w), h) h *= hn/hd wn = NP.dot(d, hT) wd = NP.dot(NP.dot(w, h), hT) w *= wn/wd return NP.dot(w, h) 

To use this NNMA function, simply pass in a 2D array (matrix) with β€œ0” for each missing cell (in other words, your data matrix with β€œ0” is inserted for each missing value):

 >>> d # the original (sparse) data matrix with missing cells denoted by "0"s array([[ 7., 0., 4., 7., 0., 1.], [ 3., 9., 7., 3., 1., 7.], [ 4., 4., 3., 7., 3., 9.], [ 4., 8., 0., 9., 2., 1.], [ 6., 3., 9., 5., 9., 3.], [ 6., 1., 4., 4., 1., 0.], [ 0., 4., 8., 6., 0., 5.], [ 9., 0., 6., 0., 5., 2.], [ 6., 8., 4., 6., 3., 7.], [ 3., 6., 3., 8., 7., 2.]]) >>> d1 = nnma(d) # call nnma, passing in the original data matrix >>> d1 # the approximated data matrix with all missing values populated array([[ 6.998, 0.29 , 3.987, 7.008, 0.292, 0.796], [ 2.989, 8.92 , 6.994, 3.02 , 1.277, 7.053], [ 4.007, 4.496, 2.999, 7.01 , 3.107, 8.695], [ 4.005, 8.019, 0.254, 9.002, 1.917, 0.89 ], [ 5.998, 3.014, 9.001, 4.991, 8.983, 3.052], [ 5.992, 1.077, 4.007, 3.976, 0.753, 0.464], [ 0.346, 3.436, 7.993, 5.988, 0.194, 5.355], [ 9.001, 0.124, 5.997, 0.375, 5.02 , 1.867], [ 6. , 7.994, 3.998, 6. , 2.999, 7.009], [ 2.995, 6.022, 3.001, 7.987, 6.939, 2.185]]) 

So, as you can see, the results are not so bad, especially for a very simple implementation. All missing elements are filled, and the remaining values ​​are close to the corresponding value from the original data matrix, for example, column 0, row 0 is 7.0 in the original data matrix and 6.998 in the approximate.

+9
source

The piece you are missing is a method of measuring distances. Pearson correlation is one of the most widely used methods. Distance Cosine is another. The distance L1 (the sum of the absolute differences) usually does not give good results.

If you specify google, you will find the recommended way to eliminate missing values ​​based on the similarity distance you are using. For example, in Pearson, only books that are usually evaluated by two users are used to measure correlation, so missing values ​​are simply ignored. This makes sense, as if a small part of the books read by two users is shared, which most likely implies that they have different tastes. At Kosin distance, the missing values ​​can be taken as zero.

Another widely used approach is to accept missing values. You could, for example, first use Pearson to find similarities between books, and then for each person to predict the missing ratings.

+3
source

KNN is usually sensitive to #features. In real life, I expect you to have many more books.

I would try to change the function space: instead of having a function for each document, it might be worth exploring the use of book lists as functions.

 Feature1 = { books with score 1 } Feature2 = { books with score 2 } ... 

Now you can determine the distance for each function - perhaps using feedback and accuracy between two lists of two users.

Another advantage of this method is that you can easily add weight to the function - perhaps the list of books rated as 5 is more informative than the one that takes 3?

The disadvantage is clear, you will not get any increase if users A, B rated the book with 4.5 - however, it can also be solved by adding another function by comparing these lists between two users.

Disclaimer I have never tested this method, and I have no idea how it will behave - but I think this is an approach worth exploring. I think that there is no good way to determine whether this proposal will give good results, except for empirical testing, which can be done using cross-validation from your training set.

+2
source

My suggestion is that you use singular value decomposition (SVD) to perform rank reduction on your dataset. This will fill in these missing values, effectively reducing the number of functions that each book can offer. This is very similar to latent semantic analysis, look at it.

0
source

Source: https://habr.com/ru/post/915033/


All Articles