How to parse a CSV file so that it can be classified by Mahout

I am trying to classify a CSV file using Mahout, I understand that first I need to convert the data in CSV to vectors, which can then be used by one of the mahout classification algorithms. My CSV file consists of text and text values ​​and several classes.

enter image description here

I searched here and found some vague explanations on how to do this, but could not find any examples. Can someone provide a simple example of how to do this? or is there any utility that does this for you ?.

I assumed that this would be a very general task, but I really can’t find clear examples.

Any help would be greatly appreciated.

+4
source share
1 answer

You have textual and word-like meanings, so you should probably use the example of the 20th newsgroup to get inspiration. This is a good example, and you can easily reproduce the code with your csv file.

Here is the working link of the latest mahout version for 20 news:

https://github.com/jpatanooga/MahoutExamples/blob/master/src/main/java/com/cloudera/mahout/classification/sgd/TwentyNewsgroups.java

There is only an adaptation for the countWords method with changes to the TokenSream object, here is the working code with the latest version of Mahout:

private static void countWords(Analyzer analyzer, Collection<String> words, Reader in) throws IOException {

        // use the provided analyzer to tokenize the input stream
        TokenStream ts = analyzer.tokenStream("text", in);
        ts.addAttribute(CharTermAttribute.class);
        ts.reset();

        // for each word in the stream, minus non-word stuff, add word to collection
        while (ts.incrementToken()) {
            String s = ts.getAttribute(CharTermAttribute.class).toString();
            words.add(s);
        }
        ts.end();
        ts.close();

        /*overallCounts.addAll(words);*/
    } 

Hope this helps you. I used this example to adapt with a CSV file and it worked.

0
source

Source: https://habr.com/ru/post/1543214/


All Articles