How to cluster k-values in mahout with vector data stored as CSV?

Question

How to cluster k-values in mahout with vector data stored as CSV?

I have a file containing data vectors, where each line contains a list of values separated by commas. I am wondering how to cluster k-values on this data using mahout. The wiki example mentions the creation of sequenceFiles, but otherwise I'm not sure if I need to do some type of conversion to get these sequences.

+6

mahout k-means

Dan q Jan 9 '12 at 8:01

source share

2 answers

maybe you could use Elephant Bird to record mahout vectors

https://github.com/kevinweil/elephant-bird#hadoop-sequencefiles-and-pig

0

Daniel Pizarro Jan 22 '13 at 17:43

source share

Bojana popovska · Accepted Answer · 2012-01-11T12:45:01+0000

I would recommend manually reading the entries from the CSV file, creating NamedVectors from them, and then using the file sequence writer to write the vectors in the sequence file. After that, the KMeansDriver startup method should know how to process these files.

The sequence files encode key-value pairs, so the key will be the identifier of the pattern (it must be a string), and this value is the vector wrapper around the vectors.

Here is a simple code example on how to do this:

  List <NamedVector> vector = new LinkedList <NamedVector> ();
     NamedVector v1;
     v1 = new NamedVector (new DenseVector (new double [] {0.1, 0.2, 0.5}), "Item number one");
     vector.add (v1);

     Configuration config = new Configuration ();
     FileSystem fs = FileSystem.get (config);

     Path path = new Path ("datasamples / data");

     // write a SequenceFile form a Vector
     SequenceFile.Writer writer = new SequenceFile.Writer (fs, config, path, Text.class, VectorWritable.class);
     VectorWritable vec = new VectorWritable ();
     for (NamedVector v: vector) {
         vec.set (v);
         writer.append (new Text (v.getName ()), v);
     }
     writer.close ();

In addition, I would recommend reading Chapter 8 of Mahout in Action . It gives more details on the presentation of data in Mahout.

How to cluster k-values ​​in mahout with vector data stored as CSV?

More articles:

How to cluster k-values in mahout with vector data stored as CSV?