You should see how the Kmeans algorithm works . At first, stop words never reach vectorized , so vectorized are completely ignored and have no effect on how document clusters are. Now suppose you have:
sentence_list=["word1", "word2", "word2 word3"]
Let's say you need 2 clusters. In this case, you expect the second and third documents to be in the same cluster, because they have a common word. Let's see how this happens.
The numerical representation of vectorized documents looks like this:
word1 word3 word2 1 0.000000 0.000000 # doc 1 0 1.000000 0.000000 # doc 2 0 0.605349 0.795961 # doc 3
At the first Kmeans stage, some centroids are randomly selected from the data, for example, document 1 and document 3 will be the initial centroids:
Centroid 1: [1, 0.000000, 0.000000] Centroid 2: [0, 0.605349, 0.795961]
Now, if you count the distances from each point (document) to each of the two centroids, you will see that:
- document 1 has a distance from 0 to centroid 1, so it refers to centroid 1
- document 3 has a distance from 0 to centroid 2, so it refers to centroid 2
Finally, we calculate the distance between the remaining document 2 and each centroid to find out which one belongs to:
>>> from scipy.spatial.distance import euclidean >>> euclidean([0, 1, 0], [1, 0, 0]) # dist(doc2, centroid1) 1.4142135623730951 >>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2) 0.8884272507056005
So, the second and second centroids are closer, which means that the second document is assigned to the second centroid.