Processing the dikipedia dump file

I want to process the dikipedia dump file. In another sense, I want to extract the title, category and text for each article. what i want to ask about it is there some java api / tool that can help me with this. thanks in advance

+4
source share
3 answers

The Wikipedia dump file is in XML format. Therefore, for this purpose you can use any available XML tools.

Note that due to the size of the dump file, the SAX analyzer will usually be much more efficient than the DOM parser (since the DOM parser will try to load the whole thing into the memory view).

+8
source

Take a look at http://code.google.com/p/jwpl/ Its java-api, which gives you structured access to wikipedia dumps, you need a database (mysql or similar), and there are a lot of rams for the latest wikipedia dumps, by at least 4g for processing.

But it’s nice to use it: you can get an iterator over all pages or page names and is much easier to use.

+3
source

Are you looking for something like this?

http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

The page has examples of how to work with the API.

+1
source

Source: https://habr.com/ru/post/1396712/


All Articles