I am trying to use Weka to cluster K-Means on a dataset, exploring how different weights affect different attributes.
However, when I adjust the weight of each attribute, I do not see the difference in clustering.
//Initialize file readers ... Instances dataSet = readDataFile(dataReader); double[][] modifiers = readNormalizationFile(normReader, dataSet.numAttributes()); normalize(dataSet, modifiers); SimpleKMeans kMeans = new SimpleKMeans(); kMeans.setPreserveInstancesOrder(true); int[] clusters = null; try { System.out.println(kMeans.getSeed()); if(distMet != 0) kMeans.setDistanceFunction(new ManhattanDistance(dataSet)); kMeans.setNumClusters(k); kMeans.buildClusterer(dataSet); clusters = kMeans.getAssignments(); } //Print clusters
The first dimension of the modifiers array corresponds to each attribute, and each of them has two values. The first is subtracted from the attribute value, and then the result is divided by the second value.
Normalization is as follows:
public static void normalize(Instances dataSet, double[][] modifiers) { for(int i = 0; i < dataSet.numInstances(); i++) { Instance currInst = dataSet.instance(i); double[] values = currInst.toDoubleArray(); for(int j = 0; j < values.length; j++) { currInst.setValue(j, (values[j] - modifiers[j][0]) / modifiers[j][1]); } } }
My expectation is that increasing the second normalization should reduce the importance of a particular attribute for clustering and therefore change the way clusters are, but this is not what I am observing. My debugger shows that correctly ordered values ββare sent to the cluster, but it's hard for me to believe that Weka will ruin instead of me.
Did I use Weka K-Means correctly or miss something important?
source share