Yes, you can handle the page differently. The main idea is as follows
for (String word : page) { if (!forbiddenWords.contains(word)) { pageResult.append(word); } }
Here forbiddenWords are many. In addition, for (String word : page) is a shorthand for parsing a page into a list of words. Remember to add extra spaces (I skipped this for clarity).
The complexity of processing one page in the original version was ~ 50,000 * 1000, and now it is only ~ 1000. (checking if the word is in the HashSet takes a constant time)
change
Since I wanted to distract myself from work for ten minutes, here is the code :)
String text = "This is a bad word, and this is very bad, terrible word."; Set<String> forbiddenWords = new HashSet<String>(Arrays.asList("bad", "terrible")); text += "|"; // mark end of text boolean readingWord = false; StringBuilder currentWord = new StringBuilder(); StringBuilder result = new StringBuilder(); for (int pos = 0; pos < text.length(); ++pos) { char c = text.charAt(pos); if (readingWord) { if (Character.isLetter(c)) { currentWord.append(c); } else { // finished reading a word readingWord = false; if (!forbiddenWords.contains(currentWord.toString().toLowerCase())) { result.append(currentWord); } result.append(c); } } else { if (Character.isLetter(c)) { // start reading a new word readingWord = true; currentWord.setLength(0); currentWord.append(c); } else { // append punctuation marks and spaces to result immediately result.append(c); } } } result.setLength(result.length() - 1); // remove end of text mark System.out.println(result);
source share