Processing the dikipedia dump file

Question

Processing the dikipedia dump file

I want to process the dikipedia dump file. In another sense, I want to extract the title, category and text for each article. what i want to ask about it is there some java api / tool that can help me with this. thanks in advance

+4

java

user1212009 Feb 15 '12 at 20:07

source share

3 answers

Greg hewgill · Answer 1 · 2012-02-15T20:10:10+0000

The Wikipedia dump file is in XML format. Therefore, for this purpose you can use any available XML tools.

Note that due to the size of the dump file, the SAX analyzer will usually be much more efficient than the DOM parser (since the DOM parser will try to load the whole thing into the memory view).

samy · Answer 2 · 2012-07-31T12:56:35+0000

Take a look at http://code.google.com/p/jwpl/ Its java-api, which gives you structured access to wikipedia dumps, you need a database (mysql or similar), and there are a lot of rams for the latest wikipedia dumps, by at least 4g for processing.

But it’s nice to use it: you can get an iterator over all pages or page names and is much easier to use.

Jon lin · Answer 3 · 2012-02-15T20:11:28+0000

Are you looking for something like this?

http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

The page has examples of how to work with the API.

Processing the dikipedia dump file

More articles: