It is very simple, and the process is very similar for each semi-structured page from which you extract information.
First, you need to uniquely identify the DOM element where the required information is located. The easiest way to do this is to use a web development tool like Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.
Using the sample Potato article, you will find that the <p>
aragraph you are interested in is in the following block:
<div class="mw-content-ltr" lang="en" dir="ltr"> <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div> <div class="dablink">[...]</div> <div class="dablink">[...]</div> <div>[...]</div> <p>The potato [...]</p> <p>[...]</p> <p>[...]</p>
In other words, you want to find the first <p>
element that is inside the div
with a class
called mw-content-ltr
.
Then you just need to select this element with jsoup using its selector syntax, for example (which is very similar to jQuery):
public class WikipediaParser { private final String baseUrl; public WikipediaParser(String lang) { this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang); } public String fetchFirstParagraph(String article) throws IOException { String url = baseUrl + article; Document doc = Jsoup.connect(url).get(); Elements paragraphs = doc.select(".mw-content-ltr p"); Element firstParagraph = paragraphs.first(); return firstParagraph.text(); } public static void main(String[] args) throws IOException { WikipediaParser parser = new WikipediaParser("en"); String firstParagraph = parser.fetchFirstParagraph("Potato"); System.out.println(firstParagraph);
source share