Yes, you can use Jsoup to get text from a web page, for example:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); String text = doc.body().text();
Then you need to count the words and find out which ones are the most frequent. This code looks promising. We need to change it to use our String output from Jsoup, something like this:
import java.io.*; import java.nio.charset.StandardCharsets; import java.util.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupWordCount { public static void main(String[] args) throws IOException { long time = System.currentTimeMillis(); Map<String, Word> countMap = new HashMap<String, Word>();
Output:
Downloading page... Analyzing text... 42 of 24 in 20 Wikipedia 19 to 16 is 11 that 10 The 9 was 8 articles 7 featured Finished in 3300 ms
Some notes:
This code may ignore some words, such as "the", "and", "a", etc. You will have to configure it.
It seems that sometimes there are problems with Unicode characters. Although I do not experience this, someone in the comments did.
This can be done better with less code.
Not verified.
Enjoy it!
source share