WEKA: StringToWordVector results differ between CUI and my Java code

I use this option in the GUI and JAVA:

-R last -W 1000 -prune-rate -1.0 -C -I -N 0 -S -stemmer weka.core.stemmers.NullStemmer -M 20 -tokenizer weka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!" 

JAVA:

 @relation 'testing2-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-CI-N0-S-stemmerweka.core.stemmers.NullStemmer-M20-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"' 

GUI:

 weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-CI-N0-S-stemmerweka.core.stemmers.NullStemmer-M20-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!" 

And, unfortunately, I get different results: the number of attributes that I received in the graphical interface (86), and the number of attributes in JAVA is about (300).

For the same data set.

This is the code I used:

 BufferedReader reader = new BufferedReader(new FileReader("newTemp.arff")); Instances dataRaw = new Instances(reader); reader.close(); StringToWordVector filter = new StringToWordVector(); filter.setInputFormat(dataRaw); filter.setAttributeIndices("last"); filter.setDoNotOperateOnPerClassBasis(false); filter.setOutputWordCounts(true); filter.setWordsToKeep(1000); filter.setUseStoplist(true); filter.setIDFTransform(true); filter.setMinTermFreq(20); filter.setDoNotOperateOnPerClassBasis(false); filter.setPeriodicPruning(-1); String[] options = filter.getOptions(); for(int i=0;i<options.length;i++) { if (options[i].length() > 0) System.out.println(options[i]); } Instances dataFiltered = Filter.useFilter(dataRaw, filter); System.out.println("\n\n=====> Filtered data:<===\n\n" + dataFiltered.toString()); 

I do not know what is wrong. please help me drown.

+4
source share

Source: https://habr.com/ru/post/1343421/


All Articles