Get text content from mediawiki page via API

I am new to MediaWiki and now I have a problem. I have a Wiki page title, and I want to get only the text of the specified page using api.php , but all I found in the API is a way to get the contents of the Wiki page (with wiki markup). I used this HTTP request ...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test 

But I only need text content without Wiki markup. Is this possible using the MediaWiki API?

+47
wikipedia-api mediawiki-api mediawiki
Oct 26 '09 at 14:32
source share
10 answers

I do not think that using the API you can just get the text.

What worked for me was to request an HTML page (using the normal URL that will be used in the browser) and cross out the HTML tags below the contents of the div.

EDIT:

I had good results using HTML Parser for Java. It has examples of how to cut HTML tags under a given DIV.

+4
Oct 26 '09 at 14:51
source share

Use action=parse to get html:

/api.php?action=parse&page=test

One way to get text from html is to load it into the browser and go through the nodes, looking only at the text nodes using JavaScript.

+60
May 27 '11 at 16:50
source share

The TextExtracts API extension does what you ask for. Use prop=extracts to get a cleared answer. For example, this link will give you cleared text for a stack overflow article . What is also nice is that it still contains section tags, so you can identify individual sections of the article.

To include a visible link in my answer, the above link looks like this:

 /api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true 

Edit: As Amr mentioned, TextExtracts is an extension for MediaWiki, so it will not necessarily be available for every MediaWiki site.

+33
Feb 18 '14 at 4:05
source share

Adding ?action=raw at the end of the MediaWiki page returns the latest content in raw text format. For example: - https://en.wikipedia.org/wiki/Main_Page?action=raw

+23
Mar 06
source share

You can get wiki data in text format from the API using the explaintext parameter. Also, if you need to access a lot of caption information, you can get all of the wiki title data in one go. Use pipe symbol | to separate each heading. For example, this API call will return data from the Google and Yahoo pages:

 http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects= 

Options:

  • explaintext : Returns excerpts as plain text instead of limited HTML.
  • exlimit=max : Return multiple results. The maximum is currently 20.
  • exintro : Returns only the contents up to the first section. If you want to get full data, just delete it.
  • redirects= : resolve redirect problems.
+20
Jun 10 '15 at 18:31
source share
+8
Apr 24 '12 at 18:41
source share

Wiki pages without any formatting characters will not make much sense in many cases.

You can format the formatting yourself if you want, but in this process you will break some things.

(If you do not create something like a search engine, in this case you only need the text parts and can completely ignore the formatting characters)

0
Oct 26 '09 at 14:49
source share

Python users coming to this question may be interested in wikipedia module ( docs ):

 import wikpedia wikipedia.set_lang('de') page = wikipedia.page('Wikipedia') print(page.content) 

Each formatting, with the exception of sections ( == ), is separated.

0
Aug 03 '17 at 6:52
source share
0
Dec 27 '17 at 23:15
source share

You can do one thing after the content is placed on your page - you can use the PHP strip_tags() function to remove the HTML tags.

-four
Jun 23 '17 at 14:50
source share



All Articles