First of all, thanks for reading my question.
I used TF / IDF, then from these values I calculated the similarity to the cosine to see how many documents are more similar. You can see the following matrix. Column names are similar to doc1, doc2, doc3, and row names are the same as doc1, doc2, doc3, etc. Using the following matrix, I see that doc1 and doc4 have a 72% similarity (0.722711142). This is correct, even if I see both documents, they are similar. I have 1000 documents and I see every frequency of the document. in the matrix to see how many are similar. I used different clustering, such as k-tools and agnes (hierarchy), to combine them. He created clusters. For example, Cluster1 has (doc4, doc5, doc3) because they have values (0.722711142, 0.602301766, 0.69912109) more closely respectively. But when I see manually, if these 3 documents are really the same, they are NOT .:(What am I doing or do I need to use something other than clustering?
1 0.067305859 -0.027552299 0.602301766 0.722711142
0.067305859 1 0.048492904 0.029151952 -0.034714695
-0.027552299 0.748492904 1 0.610617214 0.010912109
0.602301766 0.029151952 -0.061617214 1 0.034410392
0.722711142 -0.034714695 0.69912109 0.034410392 1
PS: The values may be wrong, it just gives you an idea. If you have any questions, please ask. Thanks
source
share