Find the most common words on a web page (using Jsoup)?

Question

Find the most common words on a web page (using Jsoup)?

In my project, I have to count the most common words in a Wikipedia article. I found Jsoup to parse the HTML format, but this still leaves a problem with the frequency of words. Is there a function in Jsoup that takes into account freqeuncy words or any way to find which words are most often found on a web page using Jsoup?

Thanks.

+1

java html jsoup webpage word-frequency

Adem Apr 4 '15 at 14:27

source share

1 answer

Jonascz · Accepted Answer · 2015-04-04T14:53:22+0000

Yes, you can use Jsoup to get text from a web page, for example:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); String text = doc.body().text();

Then you need to count the words and find out which ones are the most frequent. This code looks promising. We need to change it to use our String output from Jsoup, something like this:

 import java.io.*; import java.nio.charset.StandardCharsets; import java.util.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupWordCount { public static void main(String[] args) throws IOException { long time = System.currentTimeMillis(); Map<String, Word> countMap = new HashMap<String, Word>(); //connect to wikipedia and get the HTML System.out.println("Downloading page..."); Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); //Get the actual text from the page, excluding the HTML String text = doc.body().text(); System.out.println("Analyzing text..."); //Create BufferedReader so the words can be counted BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8)))); String line; while ((line = reader.readLine()) != null) { String[] words = line.split("[^A-ZÃƒâ€¦Ãƒâ€žÃƒâ€"a-zÃƒÂ¥ÃƒÂ¤ÃƒÂ¶]+"); for (String word : words) { if ("".equals(word)) { continue; } Word wordObj = countMap.get(word); if (wordObj == null) { wordObj = new Word(); wordObj.word = word; wordObj.count = 0; countMap.put(word, wordObj); } wordObj.count++; } } reader.close(); SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values()); int i = 0; int maxWordsToDisplay = 10; String[] wordsToIgnore = {"the", "and", "a"}; for (Word word : sortedWords) { if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for break; } if (Arrays.asList(wordsToIgnore).contains(word.word)) { i++; maxWordsToDisplay++; } else { System.out.println(word.count + "\t" + word.word); i++; } } time = System.currentTimeMillis() - time; System.out.println("Finished in " + time + " ms"); } public static class Word implements Comparable<Word> { String word; int count; @Override public int hashCode() { return word.hashCode(); } @Override public boolean equals(Object obj) { return word.equals(((Word)obj).word); } @Override public int compareTo(Word b) { return b.count - count; } } }

Output:

 Downloading page... Analyzing text... 42 of 24 in 20 Wikipedia 19 to 16 is 11 that 10 The 9 was 8 articles 7 featured Finished in 3300 ms

Some notes:

This code may ignore some words, such as "the", "and", "a", etc. You will have to configure it.
It seems that sometimes there are problems with Unicode characters. Although I do not experience this, someone in the comments did.
This can be done better with less code.
Not verified.

Enjoy it!

Find the most common words on a web page (using Jsoup)?

More articles: