Get text content from mediawiki page via API

Question

Get text content from mediawiki page via API

I am new to MediaWiki and now I have a problem. I have a Wiki page title, and I want to get only the text of the specified page using api.php , but all I found in the API is a way to get the contents of the Wiki page (with wiki markup). I used this HTTP request ...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

But I only need text content without Wiki markup. Is this possible using the MediaWiki API?

+47

wikipedia-api mediawiki-api mediawiki

Le_Coeur Oct 26 '09 at 14:32

source share

10 answers

Use action=parse to get html:

/api.php?action=parse&page=test

One way to get text from html is to load it into the browser and go through the nodes, looking only at the text nodes using JavaScript.

+60

gilly3 May 27 '11 at 16:50

source share

The TextExtracts API extension does what you ask for. Use prop=extracts to get a cleared answer. For example, this link will give you cleared text for a stack overflow article . What is also nice is that it still contains section tags, so you can identify individual sections of the article.

To include a visible link in my answer, the above link looks like this:

 /api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension for MediaWiki, so it will not necessarily be available for every MediaWiki site.

+33

eric.mitchell Feb 18 '14 at 4:05

source share

Adding ?action=raw at the end of the MediaWiki page returns the latest content in raw text format. For example: - https://en.wikipedia.org/wiki/Main_Page?action=raw

+23

baijum Mar 06

source share

You can get wiki data in text format from the API using the explaintext parameter. Also, if you need to access a lot of caption information, you can get all of the wiki title data in one go. Use pipe symbol | to separate each heading. For example, this API call will return data from the Google and Yahoo pages:

 http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Options:

explaintext : Returns excerpts as plain text instead of limited HTML.
exlimit=max : Return multiple results. The maximum is currently 20.
exintro : Returns only the contents up to the first section. If you want to get full data, just delete it.
redirects= : resolve redirect problems.

+20

Anuraj Jun 10 '15 at 18:31

source share

This is the easiest way: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content

+8

Hardest Apr 24 '12 at 18:41

source share

Wiki pages without any formatting characters will not make much sense in many cases.

You can format the formatting yourself if you want, but in this process you will break some things.

(If you do not create something like a search engine, in this case you only need the text parts and can completely ignore the formatting characters)

0

Joel L Oct 26 '09 at 14:49

source share

Python users coming to this question may be interested in wikipedia module ( docs ):

 import wikpedia wikipedia.set_lang('de') page = wikipedia.page('Wikipedia') print(page.content)

Each formatting, with the exception of sections ( == ), is separated.

0

Martin Thoma Aug 03 '17 at 6:52

source share

Use action = render to get the cleanest possible page:

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I?action=render

against

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I

0

Yaza Dec 27 '17 at 23:15

source share

You can do one thing after the content is placed on your page - you can use the PHP strip_tags() function to remove the HTML tags.

-four

user8205791 Jun 23 '17 at 14:50

source share

Eric Normand · Accepted Answer · 2009-10-26 14:51

I do not think that using the API you can just get the text.

What worked for me was to request an HTML page (using the normal URL that will be used in the browser) and cross out the HTML tags below the contents of the div.

EDIT:

I had good results using HTML Parser for Java. It has examples of how to cut HTML tags under a given DIV.

Get text content from mediawiki page via API

More articles: