Result of dropping clusters with vector names

I created my Vectors as described in this question and performed mahout kmeans according to the data.

Since I use Mahout 0.7, the clusterdump command clusterdump not work as described in Mach in action, but I got it to work as follows:

 export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT 

and I get lines like this:

 VL-1383471{n=192 c=[0.180, -0.087, 0.281, 0.512, 0.678, 1.833, 2.613, 0.313, 0.226, 1.023, 0.229, -0.104, -0.461, -0.553, -0.318, 0.315, 0.658, 0.245, 0.635, 0.220, 0.660, 0.193, 0.277, -0.182, 0.497, 0.346, 0.658, 0.660, 0.191, 0.660, 0.636, 0.018, 0.519, 0.335, 0.535, 0.008, -0.028, 0.461, 0.229, 0.287, 0.619, 0.509, 0.566, 0.389, -0.075, -0.180, -0.461, 0.381, -0.108, 0.126, -0.728] r=[0.983, 0.890, 0.384, 0.823, 0.702, 0.000, 0.000, 1.132, 0.605, 0.979, 0.897, 0.862, 0.438, 0.546, 0.390, 0.171, 0.257, 0.234, 0.251, 0.106, 0.257, 0.093, 0.929, 0.077, 0.204, 0.218, 0.257, 0.257, 0.258, 0.257, 0.249, 0.112, 0.217, 0.157, 0.284, 0.197, 0.228, 0.229, 0.323, 0.401, 0.248, 0.217, 0.269, 1.002, 0.819, 0.706, 0.412, 0.964, 0.787, 0.872, 0.172]} 

which is still not useful to me, since I need the names of my vectors in each cluster. I saw that a dictionary file is being created for text documents. How to create a dictionary for my data?

Also, using -of CSV gives me an empty file, am I doing something wrong?

Another attempt I made was to directly access the cluster-20-final/part-m-00000 file, as in Listing 7.2 Mahout in action . It turns out that it does not contain WeightedVectorWritable , but ClusterWritable , from which I can get a Cluster instance, but not containing any relevant Vector .

+2
source share
2 answers

A little late, but it might someday help someone.

At startup

 KMeansDriver.run(input, clustersIn, outputPath, measure, convergenceDelta, maxIterations, true, 0.0, false); 

One output was a directory called clusteredPoints. There is a part file with all the clustered vectors in the cluster. It means something like this

  IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); Path clusteredPoints = new Path(output + "/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"); FileSystem fs = FileSystem.get(clusteredPoints.toUri(), new Configuration()); try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, clusteredPoints, fs.getConf())) { while (reader.next(key, value)) { // Do something useful here ((NamedVector) value.getVector()).getName(); } } catch (Throwable t) { throw t; } 

seems to be doing the trick. Using something like this, I was able to get a good idea of ​​what was grouped when when performing my tests with k-mean and Mahout clusters.

I used Mahout 0.8 when I did this.

+1
source

(a very late answer, but since I just spent the day realizing this, I thought that I would share it)

What you are missing is a dictionary of a vector-sized name in its index. This dictionary will be used by clusterdump to give you the names of the various dimensions in the vector.

When starting clusterdump, you can specify two additional flags:

  • d: dictionary file
  • dt: dictionary file type (text | sequence file)

Here is an example call:

 mahout clusterdump -i clusteringExperiment/exp1/initialCentroids/clusters-0-final -d clusteringExperiment/dictionary/vectorDimensions -dt sequencefile 

and your result will look something like this:

 VL-0{n=185 c=[A:0.006, G:0.550, M:0.011, O:0.026, S:0.000, T:0.072, U:0.096, V:0.010] r=[A:0.029, G:0.176, M:0.043, O:0.054, S:0.001, T:0.098, U:0.113, V:0.035]} 

Note that the dictionary is a simple key file, where the key is the category name (string) and the value is a numerical index.

0
source

Source: https://habr.com/ru/post/905495/


All Articles