How to remove stop words from large collection files in a more efficient way?

I have 200,000 files for which I have to process and retrieve tokens for each file. All files are 1.5 GB in size. When I wrote code to extract tokens from each file, it works well. For the entire execution time - 10 minutes.

After that I tried to remove stopwords Performance dropped a lot. It takes 25 to 30 minutes.

I use stop words from the website here. There are about 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.

This is a stub code

 StringBuilder sb = new StringBuilder(); for(String s : tokens) Scanner sc=new Scanner(new File("stopwords.txt")); while(sc.hasNext()) { if(sc.next().equals(s)){ flag = true; break; } } if(flag) sb.append(s + "\n" ); flag = false; } String str = sb.toString() 

** Ignore errors.

Performance above code is at least 10 times less than below code. It takes 50 to 60 minutes.

 StringBuilder sb = new StringBuilder(); String s = tokens.toString(); String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," "); 

Performance is very good. It takes 20 to 25 minutes.

Is there a better procedure?

+1
source share
2 answers

Of course, this is bad. You are doing O(n^2) comparisons. For every word you compare with another word. You need to rethink your algorithm.

Read all the stop words in, and then just tick set.contains(word) . This will greatly improve your productivity.

+3
source

You should consider using the Apache Lucene API.

It provides functionality for indexing files and removing stop words, removing tokens, searching for and similarity of documents based on LSA

0
source

Source: https://habr.com/ru/post/980244/


All Articles