Can I get better performance for this cycle?

I read a book and delete a few words from it. My problem is that the process takes a lot of time and I want to improve its performance (less time), for example:

Vector<String> pages = new Vector<String>(); // Contains about 1500 page, each page has about 1000 words. Vector<String> wordsToDelete = new Vector<String>(); // Contains about 50000 words. for( String page: pages ) { String pageInLowCase = page.toLowerCase(); for( String wordToDelete: wordsToDelete ) { if( pageInLowCase.contains( wordToDelete ) ) page = page.replaceAll( "(?i)\\b" + wordToDelete + "\\b" , "" ); } // Do some staff with the final page that does not take much time. } 

This code takes about 3 minutes. If I skipped the replaceAll (...) loop, I can save more than 2 minutes. So, is there a way to do the same cycle with higher performance?

+4
source share
5 answers

Yes, you can handle the page differently. The main idea is as follows

 for (String word : page) { if (!forbiddenWords.contains(word)) { pageResult.append(word); } } 

Here forbiddenWords are many. In addition, for (String word : page) is a shorthand for parsing a page into a list of words. Remember to add extra spaces (I skipped this for clarity).

The complexity of processing one page in the original version was ~ 50,000 * 1000, and now it is only ~ 1000. (checking if the word is in the HashSet takes a constant time)

change
Since I wanted to distract myself from work for ten minutes, here is the code :)

  String text = "This is a bad word, and this is very bad, terrible word."; Set<String> forbiddenWords = new HashSet<String>(Arrays.asList("bad", "terrible")); text += "|"; // mark end of text boolean readingWord = false; StringBuilder currentWord = new StringBuilder(); StringBuilder result = new StringBuilder(); for (int pos = 0; pos < text.length(); ++pos) { char c = text.charAt(pos); if (readingWord) { if (Character.isLetter(c)) { currentWord.append(c); } else { // finished reading a word readingWord = false; if (!forbiddenWords.contains(currentWord.toString().toLowerCase())) { result.append(currentWord); } result.append(c); } } else { if (Character.isLetter(c)) { // start reading a new word readingWord = true; currentWord.setLength(0); currentWord.append(c); } else { // append punctuation marks and spaces to result immediately result.append(c); } } } result.setLength(result.length() - 1); // remove end of text mark System.out.println(result); 
+12
source

To get started, you can get rid of the contains(..) check. It adds unnecessary overhead. And it will return the truth sometimes when it is not. For example, it will return true for "not", even if the page has only a "node".

Another thing is to replace Vector with ArrayList .

And as Conrad pointed out in his comment, you are not changing vectors. String is immutable, so you are not changing objects. You will need to use set(..) (and maintain the iteration index).

+5
source

The problem is that you have a double loop. This tends to be poor performance and equates to x * y performance. In addition, since the rows cannot be changed every time you call toLowerCase and then replaceAll, you create a new row. Thus, you create x * y the number of lines containing the whole page for each word in your list. This can be avoided by using the MULTI_LINE and CASE_INSENSITIVE options in the regular expression.

You can reduce it to one cycle and use the regular expression to replace all words at once.

  StringBuffer buffer = new StringBuffer(); for (String word : wordsToDelete) { if (buffer.length() != 0) { buffer.append("|"); } buffer.append("(\\b"); buffer.append(word); buffer.append("\\b)"); } Pattern pattern = Pattern.compile(buffer.toString(), Pattern.CASE_INSENSITIVE | Pattern.MULTILINE); List<String> newPageList = new ArrayList<String>(); for (String page : pages) { Matcher matcher = pattern.matcher(page); String newPage = matcher.replaceAll(""); newPageList.add(newPage); } 
+1
source

Use java.lang.StringBuilder - it was created specifically for modified text.

 StringBuilder builder = new StringBuilder(page); for (String word: wordsToDelete) { int position = 0; int newpos = 0; while ((newpos = builder.indexOf(word, position))>=0) { builder.delete(position, position+word.length()); position = newpos; } } 

It’s just an idea - it doesn’t check word boundaries.

0
source

Assuming the pages are independent, and if you have many cores and you have many pages to process, this loop can be parallelized, as well as:

 final ArrayList<String> pages = ...; final Set<String> wordsToDelete = ...; final ExecutorService pageFrobber = Executors.newFixedThreadPool(8); //pick suitable size final List<Callable<String>> toFrobPages = new ArrayList<Callable<String>>(pages.size()); for( final String page: pages ) { toFrobPages.add(new Callable<String>() { String call() { return page.toLowerCase().replaceAll( "(?i)\\b" + wordToDelete + "\\b" , "" ); } }); } final List<Future<String>> frobbedPages = pageFrobber.executeAll(toFrobPages); // the above will block until all pages are processed // frobbedPages will contain a set of Future<String> which can be converted to strings // by calling get() 
0
source

Source: https://habr.com/ru/post/1333415/


All Articles