I would recommend manually reading the entries from the CSV file, creating NamedVectors from them, and then using the file sequence writer to write the vectors in the sequence file. After that, the KMeansDriver startup method should know how to process these files.
The sequence files encode key-value pairs, so the key will be the identifier of the pattern (it must be a string), and this value is the vector wrapper around the vectors.
Here is a simple code example on how to do this:
List <NamedVector> vector = new LinkedList <NamedVector> ();
NamedVector v1;
v1 = new NamedVector (new DenseVector (new double [] {0.1, 0.2, 0.5}), "Item number one");
vector.add (v1);
Configuration config = new Configuration ();
FileSystem fs = FileSystem.get (config);
Path path = new Path ("datasamples / data");
// write a SequenceFile form a Vector
SequenceFile.Writer writer = new SequenceFile.Writer (fs, config, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable ();
for (NamedVector v: vector) {
vec.set (v);
writer.append (new Text (v.getName ()), v);
}
writer.close ();
In addition, I would recommend reading Chapter 8 of Mahout in Action . It gives more details on the presentation of data in Mahout.
source share