Java - introducing machine learning in text mining

I have several texts, and I would like to use them, introducing Machine Learning methods in Java using Weka libraries. To this end, I have already done something, but since all the code is too long, I just want to show some key methods and get an idea of ​​how to train and test my data set, as well as interpret the results, etc.

FYI, I process tweets from Twitter4J.

First, I selected tweets and saved them in a text file (of course, in the ARFF format). Then I manually designated them in relation to their moods (positive, neutral, negative). Based on the selected classifier, I created a test set from my training set thanks to cross-validation . Finally, I classified them and printed a summary and confusion.

Here is one of my classifiers: Naive Bayes code:

public static void ApplyNaiveBayes(Instances data) throws Exception {

    System.out.println("Applying Naive Bayes \n");
    data.setClassIndex(data.numAttributes() - 1); 
    StringToWordVector swv = new StringToWordVector();
    swv.setInputFormat(data);
    Instances dataFiltered = Filter.useFilter(data, swv);
    //System.out.println("Filtered data " +dataFiltered.toString());

    System.out.println("\n\nFiltered data:\n\n" + dataFiltered);

    Instances[][] split = crossValidationSplit(dataFiltered, 10);
    Instances[] trainingSets = split[0];
    Instances[] testingSets = split[1];


    NaiveBayes classifier = new NaiveBayes(); 

    FastVector predictions = new FastVector();


    classifier.buildClassifier(dataFiltered);
    System.out.println("\n\nClassifier model:\n\n" + classifier);     

    // Test the model
    for (int i = 0; i < trainingSets.length; i++) {
        classifier.buildClassifier(trainingSets[i]);
        // Test the model         
        Evaluation eTest = new Evaluation(trainingSets[i]);
        eTest.evaluateModel(classifier, testingSets[i]);

        // Print the result to the Weka explorer:
        String strSummary = eTest.toSummaryString();
        System.out.println(strSummary);

        // Get the confusion matrix
        double[][] cmMatrix = eTest.confusionMatrix();
        for(int row_i=0; row_i<cmMatrix.length; row_i++){
            for(int col_i=0; col_i<cmMatrix.length; col_i++){
                System.out.print(cmMatrix[row_i][col_i]);
                System.out.print("|");
            }
            System.out.println();
        }
    }
}

And FYI, crossValidationSplit method:

    public static Instances[][] crossValidationSplit(Instances data, int     
    numberOfFolds) {
        Instances[][] split = new Instances[2][numberOfFolds];

        for (int i = 0; i < numberOfFolds; i++) {
            split[0][i] = data.trainCV(numberOfFolds, i);
            split[1][i] = data.testCV(numberOfFolds, i);
        }

        return split;
    }

In the end, I have 10 different results (because k = 10). One of them:

  Correctly Classified Instances           4               36.3636 %
  Incorrectly Classified Instances         7               63.6364 %
  Kappa statistic                          0.0723
  Mean absolute error                      0.427 
  Root mean squared error                  0.5922
  Relative absolute error                 93.4946 %
  Root relative squared error            116.5458 %
  Total Number of Instances               11     

  2.0|0.0|1.0|
  1.0|1.0|2.0|
  3.0|0.0|1.0|

So how can I interpret the results? Do you think I understand the training and test sets correctly? I want to get the given percentages of the configured text files (positive, neutral, negative). How to deduce my demand for these results? Thanks for reading...

+4
2

, .

, :

classifier.buildClassifier(dataFiltered);

for:

for (int i = 0; i < trainingSets.length; i++) {
    classifier.buildClassifier(trainingSets[i]);
    ...
 }

mtx . , .

-, Evaluation.crossValidateModel() :    //set the class index dataFiltered.setClassIndex(dataFiltered.numAttributes() - 1); //build a model -- choose a classifier as you want classifier.buildClassifier(dataFiltered); Evaluation eval = new Evaluation(dataFiltered); eval.crossValidateModel(classifier, dataFiltered, 10, new Random(1)); //print stats -- do not require to calculate confusion mtx, weka do it! System.out.println(classifier); System.out.println(eval.toSummaryString()); System.out.println(eval.toMatrixString()); System.out.println(eval.toClassDetailsString());

+3

. Weka / . ( ).

, . . ( ).

, , , , , , , - . Weka .

, , . .

, 11 . ?

+1

Source: https://habr.com/ru/post/1622596/


All Articles