Naive Bayes text classification algorithm

Heh there! I just need help to implement a naive Bayes text classification algorithm in Java, just to check the dataset for research purposes. It is imperative to implement the algorithm in Java; rather, using Weka or Rapid Miner tools to get results!


My dataset has the following data type:

Doc Words Category 

means that I have Training words and categories for each training (String) that was known in advance. Some of the datasets are listed below:

  Doc Words Category Training 1 Integration Communities Process Oriented Structures...(more string) A 2 Integration Communities Process Oriented Structures...(more string) A 3 Theory Upper Bound Routing Estimate global routing...(more string) B 4 Hardware Design Functional Programming Perfect Match...(more string) C . . . Test 5 Methodology Toolkit Integrate Technological Organisational 6 This test contain string naive bayes test text text test 

SO, the dataset comes from the MySQL database and may contain several training lines and test lines! The thing is, I just need to implement a naive Bayes text classification algorithm in Java.

The algorithm should follow the example below. Table 13.1

Source: Read here.


The fact is, I can implement the algorithm in Java Code myself, but I just need to know if it is possible that there is some kind of Java library with source code documentation that allows me to simply check the results.

The problem is that I just need the results in one go, it just means checking the results.

So, get to the point if anyone can tell me about some good java library that helps my code use this algorithm in Java, and that can make my dataset possible to process the results, or can someone give me good ideas how to do it easily ... something good that can help me.

I will be grateful for your help. thanks in advance

+5
source share
8 answers

According to your requirement, you can use apache's MLlib learning library. MLlib is a Sparks scalable machine learning library consisting of common learning algorithms and utilities. There is also a Java code template for implementing an algorithm using a library. So, for starters, you can:

Implement the java skeleton for Naive Bayes presented on the site as shown below.

 import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.regression.LabeledPoint; import scala.Tuple2; JavaRDD<LabeledPoint> training = ... // training set JavaRDD<LabeledPoint> test = ... // test set final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0); JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<Double, Double>(model.predict(p.features()), p.label()); } }); double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) test.count(); 

There is no better solution here to test your datasets than using Spark SQL . MLlib fits in perfectly with the Spark API. To start using it, I would recommend that you first go through the MLlib API , completing the Algorithm according to your needs. It is quite simple using the library. For the next step that allows you to process your datasets, just use Spark SQL . I recommend that you stick with this. I also hunted a lot of options before settling this convenient library, and it provides unhindered support for interacting with some other technologies. I would post the full code here to fully match your answer. But I think you are good to go.

+2
source

You can use the Weka Java API and include it in your project if you do not want to use the graphical interface.

Here is a link to the documentation for including the classifier in your code: https://weka.wikispaces.com/Use+WEKA+in+your+Java+code

+1
source

Please check out the Bow toolkit .

It is licensed and source code is Gnu. Some of his code includes

Setting the weights of vector words in accordance with naive bayes, TFIDF and several other methods.

Perform test / train sections and automatic classification tests.

It is not a Java library, but you can compile C code to make sure that you have Java similar results for this package.

I also noticed a worthy Dr. Dobbs , which is implemented in Perl. Once again, not the desired Java, but will give you the one-time results you ask for.

0
source

Hi, I think Spark will help you a lot: http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.html you can even choose the language that you think suits your Java needs best / Python / scala!

0
source

Use scipy from python. There is already an implementation of what you need:

 class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)ΒΆ 

scipy

0
source

You can use an algorithm platform such as KNIME, it has many classification algorithms (included in the naive bayon). You can run it using the GUI or Java API.

0
source

If you want to implement a naive Bayes text classification algorithm in Java, then the WEKA Java API will be the best solution. The dataset must be in .arff format. Creating a .arff file from mySql is very simple. Here is the java code attachment for the classifier - a link to the .arff sample file.

Create a new text document. Open it using Notepad. Copy and paste all the texts under the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff

Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm

Code for the classifier:

 public static void main(String[] args) { try { StringBuilder txtAreaShow = new StringBuilder(); //reads the arff file BufferedReader breader = null; breader = new BufferedReader(new FileReader("DataSet.arff")); //if 40 attributes availabe then 39 will be the class index/attribuites(yes/no) Instances train = new Instances(breader); train.setClassIndex(train.numAttributes() - 1); breader.close(); // NaiveBayes nB = new NaiveBayes(); nB.buildClassifier(train); Evaluation eval = new Evaluation(train); eval.crossValidateModel(nB, train, 10, new Random(1)); System.out.println("Run Information\n====================="); System.out.println("Scheme: " + train.getClass().getName()); System.out.println("Relation: "); System.out.println("\nClassifier Model(full training set)\n==============================="); System.out.println(nB); System.out.println(eval.toSummaryString("\nSummary Results\n==================", true)); System.out.println(eval.toClassDetailsString()); System.out.println(eval.toMatrixString()); //txtArea output txtAreaShow.append("\n\n\n"); txtAreaShow.append("Run Information\n===================\n"); txtAreaShow.append("Scheme: " + train.getClass().getName()); txtAreaShow.append("\n\nClassifier Model(full training set)" + "\n======================================\n"); txtAreaShow.append("" + nB); txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true)); txtAreaShow.append(eval.toClassDetailsString()); txtAreaShow.append(eval.toMatrixString()); txtAreaShow.append("\n\n\n"); System.out.println(txtAreaShow.toString()); } catch (FileNotFoundException ex) { System.err.println("File not found"); System.exit(1); } catch (IOException ex) { System.err.println("Invalid input or output."); System.exit(1); } catch (Exception ex) { System.err.println("Exception occured!"); System.exit(1); } 
0
source

Source: https://habr.com/ru/post/1210673/


All Articles