I have several texts, and I would like to use them, introducing Machine Learning methods in Java using Weka libraries. To this end, I have already done something, but since all the code is too long, I just want to show some key methods and get an idea of how to train and test my data set, as well as interpret the results, etc.
FYI, I process tweets from Twitter4J.
First, I selected tweets and saved them in a text file (of course, in the ARFF format). Then I manually designated them in relation to their moods (positive, neutral, negative). Based on the selected classifier, I created a test set from my training set thanks to cross-validation . Finally, I classified them and printed a summary and confusion.
Here is one of my classifiers: Naive Bayes code:
public static void ApplyNaiveBayes(Instances data) throws Exception {
System.out.println("Applying Naive Bayes \n");
data.setClassIndex(data.numAttributes() - 1);
StringToWordVector swv = new StringToWordVector();
swv.setInputFormat(data);
Instances dataFiltered = Filter.useFilter(data, swv);
System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
Instances[][] split = crossValidationSplit(dataFiltered, 10);
Instances[] trainingSets = split[0];
Instances[] testingSets = split[1];
NaiveBayes classifier = new NaiveBayes();
FastVector predictions = new FastVector();
classifier.buildClassifier(dataFiltered);
System.out.println("\n\nClassifier model:\n\n" + classifier);
for (int i = 0; i < trainingSets.length; i++) {
classifier.buildClassifier(trainingSets[i]);
Evaluation eTest = new Evaluation(trainingSets[i]);
eTest.evaluateModel(classifier, testingSets[i]);
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
double[][] cmMatrix = eTest.confusionMatrix();
for(int row_i=0; row_i<cmMatrix.length; row_i++){
for(int col_i=0; col_i<cmMatrix.length; col_i++){
System.out.print(cmMatrix[row_i][col_i]);
System.out.print("|");
}
System.out.println();
}
}
}
And FYI, crossValidationSplit method:
public static Instances[][] crossValidationSplit(Instances data, int
numberOfFolds) {
Instances[][] split = new Instances[2][numberOfFolds];
for (int i = 0; i < numberOfFolds; i++) {
split[0][i] = data.trainCV(numberOfFolds, i);
split[1][i] = data.testCV(numberOfFolds, i);
}
return split;
}
In the end, I have 10 different results (because k = 10). One of them:
Correctly Classified Instances 4 36.3636 %
Incorrectly Classified Instances 7 63.6364 %
Kappa statistic 0.0723
Mean absolute error 0.427
Root mean squared error 0.5922
Relative absolute error 93.4946 %
Root relative squared error 116.5458 %
Total Number of Instances 11
2.0|0.0|1.0|
1.0|1.0|2.0|
3.0|0.0|1.0|
So how can I interpret the results? Do you think I understand the training and test sets correctly? I want to get the given percentages of the configured text files (positive, neutral, negative). How to deduce my demand for these results? Thanks for reading...