Confusion matrix for clustering in scikit-learn

I have a dataset with known labels. I want to try clustering and see if I can get the same clusters as the known labels. To measure accuracy, I need to get something like confusion.

I know that I can easily get a confusion matrix for a test suite of classification problems. I have already tried this, like this .

However, it cannot be used for clustering, since it was expected that both columns and rows would have the same set of labels, which makes sense for the classification problem. But for the cluster problem, what I expect is something like this.

Strings - Actual Tags

Columns are the new cluster names (i.e. cluster-1, cluster-2, etc.)

Is there any way to do this?

Change More detailed information.

In sklearn.metrics.confusion_matrix, he expects y_test and y_pred to have the same value, and labels to label these values.

That's why it gives a matrix that has the same labels for both rows and columns like this.

enter image description here

But in my case (KMeans Clustering), the real values ​​are strings, and the estimated values ​​are numbers (i.e. the cluster number)

Therefore, if I call confusion_matrix(y_true, y_pred) , it gives an error below.

 ValueError: Mix of label input types (string and number) 

This is a real problem. For the classification problem, this makes sense. But for the clustering task, this restriction should not be, because the real label names and the new cluster names do not have to be the same.

With this, I understand that I am trying to use the tool that is supposed to be used for classification tasks, for the clustering problem. So my question is: is there a way to get such a matrix for clustered data.

Hope the issue has become clearer. Please let me know if this is not the case.

+5
source share
2 answers

I wrote the code myself.

 # Compute confusion matrix def confusion_matrix(act_labels, pred_labels): uniqueLabels = list(set(act_labels)) clusters = list(set(pred_labels)) cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))] for i, act_label in enumerate(uniqueLabels): for j, pred_label in enumerate(pred_labels): if act_labels[j] == act_label: cm[i][pred_label] = cm[i][pred_label] + 1 return cm # Example labels=['a','b','c', 'a','b','c', 'a','b','c', 'a','b','c'] pred=[ 1,1,2, 0,1,2, 1,1,1, 0,1,2] cnf_matrix = confusion_matrix(labels, pred) print('\n'.join([''.join(['{:4}'.format(item) for item in row]) for row in cnf_matrix])) 

Edit: (Dayyyuumm) just discovered that I could do it easily with Pandas Crosstab : - /.

 labels=['a','b','c', 'a','b','c', 'a','b','c', 'a','b','c'] pred=[ 1,1,2, 0,1,2, 1,1,1, 0,1,2] # Create a DataFrame with labels and varieties as columns: df df = pd.DataFrame({'Labels': labels, 'Clusters': pred}) # Create crosstab: ct ct = pd.crosstab(df['Labels'], df['Clusters']) # Display ct print(ct) 
0
source

You can easily calculate the pair intersection matrix.

But it may be necessary to do this yourself if the sklearn library has been optimized for the case of using classification.

+1
source

Source: https://habr.com/ru/post/1273932/


All Articles