I use this option in the GUI and JAVA:
-R last -W 1000 -prune-rate -1.0 -C -I -N 0 -S -stemmer weka.core.stemmers.NullStemmer -M 20 -tokenizer weka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
JAVA:
@relation 'testing2-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-CI-N0-S-stemmerweka.core.stemmers.NullStemmer-M20-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
GUI:
weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-CI-N0-S-stemmerweka.core.stemmers.NullStemmer-M20-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
And, unfortunately, I get different results: the number of attributes that I received in the graphical interface (86), and the number of attributes in JAVA is about (300).
For the same data set.
This is the code I used:
BufferedReader reader = new BufferedReader(new FileReader("newTemp.arff")); Instances dataRaw = new Instances(reader); reader.close(); StringToWordVector filter = new StringToWordVector(); filter.setInputFormat(dataRaw); filter.setAttributeIndices("last"); filter.setDoNotOperateOnPerClassBasis(false); filter.setOutputWordCounts(true); filter.setWordsToKeep(1000); filter.setUseStoplist(true); filter.setIDFTransform(true); filter.setMinTermFreq(20); filter.setDoNotOperateOnPerClassBasis(false); filter.setPeriodicPruning(-1); String[] options = filter.getOptions(); for(int i=0;i<options.length;i++) { if (options[i].length() > 0) System.out.println(options[i]); } Instances dataFiltered = Filter.useFilter(dataRaw, filter); System.out.println("\n\n=====> Filtered data:<===\n\n" + dataFiltered.toString());
I do not know what is wrong. please help me drown.
source share