How do kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I use scikit to learn the Kmeans algorithm to comment on comments.

sentence_list=['hello how are you', "I am doing great", "my name is abc"] vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore') vectorized=vectorizer.fit_transform(sentence_list) km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1) km.fit(vectorized) 

when I print the output with vectorized, it gives me an index of words and ratings of the tf-idf index.

So I'm interested, given that we get only the number of words tfidf, how do we manage cluster documents based on individual words, rather than evaluating the entire document? Or perhaps it is. Can someone explain to me the concept of this?

+5
source share
2 answers

You should see how the Kmeans algorithm works . At first, stop words never reach vectorized , so vectorized are completely ignored and have no effect on how document clusters are. Now suppose you have:

 sentence_list=["word1", "word2", "word2 word3"] 

Let's say you need 2 clusters. In this case, you expect the second and third documents to be in the same cluster, because they have a common word. Let's see how this happens.

The numerical representation of vectorized documents looks like this:

 word1 word3 word2 1 0.000000 0.000000 # doc 1 0 1.000000 0.000000 # doc 2 0 0.605349 0.795961 # doc 3 

At the first Kmeans stage, some centroids are randomly selected from the data, for example, document 1 and document 3 will be the initial centroids:

 Centroid 1: [1, 0.000000, 0.000000] Centroid 2: [0, 0.605349, 0.795961] 

Now, if you count the distances from each point (document) to each of the two centroids, you will see that:

  • document 1 has a distance from 0 to centroid 1, so it refers to centroid 1
  • document 3 has a distance from 0 to centroid 2, so it refers to centroid 2

Finally, we calculate the distance between the remaining document 2 and each centroid to find out which one belongs to:

 >>> from scipy.spatial.distance import euclidean >>> euclidean([0, 1, 0], [1, 0, 0]) # dist(doc2, centroid1) 1.4142135623730951 >>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2) 0.8884272507056005 

So, the second and second centroids are closer, which means that the second document is assigned to the second centroid.

+6
source

TF / IDF is a measure that calculates the importance of a word in a document with respect to other words in that document. He does not calculate the importance of a single word. (and that makes sense, right? Because importance always means privilege over others!). Thus, the TF / IDF of each word is actually a measure of the importance of the document with respect to the word.

I do not see where TF / IDF is used in your code. However, it is possible to calculate the kmeans algorithm with TF / IDF scores that are used as functions. In addition, clustering for the three sample documents you mentioned is simply not possible, while the two documents do not have a common word!

Edit 1: First of all, if the word β€œcat” appears in two documents, it is possible that they will be grouped together (depending on other words in two documents and other documents). Secondly, you should learn more about k- means. You see, kmeans uses functions to cluster documents together, and each tf / idf score for every word in a document is a sign of the function that was used to compare this document with others on the body.

+1
source

Source: https://habr.com/ru/post/1209632/


All Articles