The first paragraph of Wikipedia

I am writing Java code to implement NLP tasks when using texts on Wikipedia. How can I use JSoup to extract the first paragraph of a Wikipedia article?

Thank you very much.

+4
source share
3 answers

It is very simple, and the process is very similar for each semi-structured page from which you extract information.

First, you need to uniquely identify the DOM element where the required information is located. The easiest way to do this is to use a web development tool like Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.

Using the sample Potato article, you will find that the <p> aragraph you are interested in is in the following block:

 <div class="mw-content-ltr" lang="en" dir="ltr"> <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div> <div class="dablink">[...]</div> <div class="dablink">[...]</div> <div>[...]</div> <p>The potato [...]</p> <p>[...]</p> <p>[...]</p> 

In other words, you want to find the first <p> element that is inside the div with a class called mw-content-ltr .

Then you just need to select this element with jsoup using its selector syntax, for example (which is very similar to jQuery):

 public class WikipediaParser { private final String baseUrl; public WikipediaParser(String lang) { this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang); } public String fetchFirstParagraph(String article) throws IOException { String url = baseUrl + article; Document doc = Jsoup.connect(url).get(); Elements paragraphs = doc.select(".mw-content-ltr p"); Element firstParagraph = paragraphs.first(); return firstParagraph.text(); } public static void main(String[] args) throws IOException { WikipediaParser parser = new WikipediaParser("en"); String firstParagraph = parser.fetchFirstParagraph("Potato"); System.out.println(firstParagraph); // prints "The potato is a starchy [...]." } } 
+8
source

It seems that the first paragraph is also the first <p> block in the document. So this might work:

 Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get(); Elements paragraphs = doc.select("p"); Element firstParagraph = paragraphs.first(); 

Now you can get the contents of this element

+2
source

The solution proposed by Silva works in most cases, except in JavaScript and the United States . Items should be selected as doc.select (". Mw-body-content p");

Check out this GitHub code for more details. You can also remove some metadata from HTML to increase accuracy.

+1
source

Source: https://habr.com/ru/post/1383309/


All Articles