Weka always produces the same clusters for different data.

I am trying to use Weka to cluster K-Means on a dataset, exploring how different weights affect different attributes.

However, when I adjust the weight of each attribute, I do not see the difference in clustering.

//Initialize file readers ... Instances dataSet = readDataFile(dataReader); double[][] modifiers = readNormalizationFile(normReader, dataSet.numAttributes()); normalize(dataSet, modifiers); SimpleKMeans kMeans = new SimpleKMeans(); kMeans.setPreserveInstancesOrder(true); int[] clusters = null; try { System.out.println(kMeans.getSeed()); if(distMet != 0) kMeans.setDistanceFunction(new ManhattanDistance(dataSet)); kMeans.setNumClusters(k); kMeans.buildClusterer(dataSet); clusters = kMeans.getAssignments(); } //Print clusters 

The first dimension of the modifiers array corresponds to each attribute, and each of them has two values. The first is subtracted from the attribute value, and then the result is divided by the second value.

Normalization is as follows:

 public static void normalize(Instances dataSet, double[][] modifiers) { for(int i = 0; i < dataSet.numInstances(); i++) { Instance currInst = dataSet.instance(i); double[] values = currInst.toDoubleArray(); for(int j = 0; j < values.length; j++) { currInst.setValue(j, (values[j] - modifiers[j][0]) / modifiers[j][1]); } } } 

My expectation is that increasing the second normalization should reduce the importance of a particular attribute for clustering and therefore change the way clusters are, but this is not what I am observing. My debugger shows that correctly ordered values ​​are sent to the cluster, but it's hard for me to believe that Weka will ruin instead of me.

Did I use Weka K-Means correctly or miss something important?

+5
source share
1 answer

There is an option NormalizableDistance Distance measurements (for example, Euclidean and Manhattan) called dontNormalize , which can automatically normalize the values ​​for you. By default, this will be enabled, which can lead to the cancellation of all work performed during a normal function call.

I ran tests for a random dataset, and then processed one of the attribute data for the second test, and the two clusters were identical. Setting the value to true has led to different clusters and therefore to the distribution of instances in the dataset.

Hope this helps!

+2
source

Source: https://habr.com/ru/post/1206377/


All Articles