In java - grouping similar values

First of all, thanks for reading my question.

I used TF / IDF, then from these values ​​I calculated the similarity to the cosine to see how many documents are more similar. You can see the following matrix. Column names are similar to doc1, doc2, doc3, and row names are the same as doc1, doc2, doc3, etc. Using the following matrix, I see that doc1 and doc4 have a 72% similarity (0.722711142). This is correct, even if I see both documents, they are similar. I have 1000 documents and I see every frequency of the document. in the matrix to see how many are similar. I used different clustering, such as k-tools and agnes (hierarchy), to combine them. He created clusters. For example, Cluster1 has (doc4, doc5, doc3) because they have values ​​(0.722711142, 0.602301766, 0.69912109) more closely respectively. But when I see manually, if these 3 documents are really the same, they are NOT .:(What am I doing or do I need to use something other than clustering?

    1             0.067305859  -0.027552299   0.602301766   0.722711142    
    0.067305859   1             0.048492904   0.029151952  -0.034714695 
   -0.027552299   0.748492904   1             0.610617214   0.010912109    
    0.602301766   0.029151952  -0.061617214   1             0.034410392    
    0.722711142  -0.034714695   0.69912109    0.034410392   1            

PS: The values ​​may be wrong, it just gives you an idea. If you have any questions, please ask. Thanks

+3
source share
2 answers

I am not familiar with TF / IDF, but the process can go wrong in many stages:

1, have you deleted the stop words?

2, have you applied the action? Porter streamer, for example.

3, Do you normalize the frequencies for the length of the document? (Maybe TFIDF has a solution for this, I don't know)

4, clustering is a discovery method, but not the holy grail. The documents that he receives as a group can be linked more or less, but it depends on the data, settings, clustering algorithm, etc.

What do you want to achieve? What is your setting? Good luck

+1

, , . , ( ) -. . , tf * idf, , . , .. , , , .

+1

Source: https://habr.com/ru/post/1731891/


All Articles