Identify documents from mahout clustering results

I use mahout to group text documents indexed with solr.

I used the text field in the document to form vectors. Then I used the k-means driver in mahout for clustering, and then the clusterdumper utility to dump the results.

It’s hard for me to understand the results of the dump truck. I could see the clusters formed using the time vectors in these clusters. But how do I retrieve documents from these clusters. I want the result to be input documents appearing in different clusters.

+3
source share
1 answer

. , . :

  • ClusterDumper.printClusters(), . :


    String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
                    writer.write(clusterInfo);
                    writer.write('\n');
    // list all top terms
    if (dictionary != null) {
                        String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
                        writer.write("\tTop Terms: ");
                        writer.write(topTerms);
                        writer.write('\n');
                    }

    // list all the points in the cluster
    List points = clusterIdToPoints.get(value.getId());
                    if (points != null) {
                        writer.write("\tCluster points:\n\t");
                        for (Iterator iterator = points.iterator(); iterator.hasNext();) {
                            WeightedVectorWritable point = iterator.next();
                            writer.write(String.valueOf(point.getWeight()));
                            writer.write(": ");

                            if (point.getVector() instanceof NamedVector) {
                                writer.write(((NamedVector) point.getVector()).getName() + " ");
                            }

                        }
                        writer.write('\n');
                    }

  • grep, , .
+1

Source: https://habr.com/ru/post/1769807/


All Articles