Changes in clustering results after each run in Python scikit-learn

I have a bunch of sentences and I want to group them using scikit-learn spectral clustering. I run the code and get the results without any problems. But every time I run it, I get different results. I know this is a problem with initiation, but I don’t know how to fix it. This is my part of my code that works with sentences:

vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1) X = vectorizer.fit_transform(data) # connectivity matrix for structured Ward connectivity = kneighbors_graph(X, n_neighbors=5) # make connectivity symmetric connectivity = 0.5 * (connectivity + connectivity.T) distances = euclidean_distances(X) spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize") spectral.fit(X) 

Data is a list of offers. Every time the code runs, my clustering results are different. How can I get consistent results using spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:

 vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore") X_data = vectorizer.fit_transform(data) km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0) km.fit(X_data) 

I appreciate your help.

+5
source share
4 answers

When using k-tools, you want to set the random_state parameter in KMeans (see the documentation ). Set this to either int or an instance of RandomState .

 km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1, verbose=0, random_state=3425) km.fit(X_data) 

This is important because the k-tool is not a deterministic algorithm. This usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. A visit to the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.

However, I am not sure about the example of spectral clustering. From the documentation for the random_state parameter: "A pseudo random number generator used to initialize the decomposition of the eigenvectors lobpcg with eigen_solver == 'amg' and using K-Means initialization." Obviously, the OP code is not contained in these cases, although setting the parameter may be worth the shot.

+14
source

As others have already noted, k-tools are usually implemented with random initialization. It is intentional that you may get different results.

The algorithm is only heuristic. This may lead to suboptimal results. Running it several times gives you more chances to find a good result.

In my opinion, when the results vary greatly from run to run, this indicates that the data is simply not clustered well with k-tools at all. In this case, your results are not much better than random. If the data is really suitable for clustering k-means, the results will be pretty stable! If they change, clusters may not have the same size or may not be well divided; and other algorithms can give better results.

+1
source

I had a similar problem, but I wanted the dataset from another distribution to be clustered in the same way as the original dataset. For example, all color images of the original dataset were in cluster 0 , and all gray images of the original dataset were in cluster 1 . For another dataset, I want the color images / gray images to be in cluster 0 and cluster 1 .

Here is the code I stole from Kaggler - in addition to random_state to the seed, you use the k-average model returned by KMeans to cluster another dataset. This works quite well. However, I can not find the official scikit-Learn document that says this.

 # reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence from sklearn.cluster import KMeans seed = 42 def create_color_clusters(img_df, cluster_count = 2, cluster_maker=None): if cluster_maker is None: cluster_maker = KMeans(cluster_count, random_state=seed) cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]) img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1) return img_df, cluster_maker # Now K-Mean your images `img_df` to two clusters img_df, cluster_maker = create_color_clusters(img_df, 2) # Cluster another set of images using the same kmean-model another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker) 

However, even setting random_state to int seed cannot guarantee that the same data will always be grouped in the same order between machines. The same data can be grouped as group 0 on one computer and grouped as group 1 on another machine. But at least with the same K-Means model ( cluster_maker in my code) we make sure that data from another distribution will be clustered in the same way as the original dataset.

+1
source

As a rule, when performing algorithms with many local minima, it is customary to use a stochastic approach and repeatedly run the algorithm with different initial states. This will give you some results, and the one with the lowest error is usually chosen as the best result.

When I use K-Means, I always run it several times and use the best result.

0
source

Source: https://habr.com/ru/post/1202906/


All Articles