I have 200,000 files for which I have to process and retrieve tokens for each file. All files are 1.5 GB in size. When I wrote code to extract tokens from each file, it works well. For the entire execution time - 10 minutes.
After that I tried to remove stopwords Performance dropped a lot. It takes 25 to 30 minutes.
I use stop words from the website here. There are about 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.
This is a stub code
StringBuilder sb = new StringBuilder(); for(String s : tokens) Scanner sc=new Scanner(new File("stopwords.txt")); while(sc.hasNext()) { if(sc.next().equals(s)){ flag = true; break; } } if(flag) sb.append(s + "\n" ); flag = false; } String str = sb.toString()
** Ignore errors.
Performance above code is at least 10 times less than below code. It takes 50 to 60 minutes.
StringBuilder sb = new StringBuilder(); String s = tokens.toString(); String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");
Performance is very good. It takes 20 to 25 minutes.
Is there a better procedure?
source share