Is there a solution for parsing a wikipedia xml dump file in Java?

I'm trying to parse this huge 25GB Plus wikipedia XML file. Any solution that helps will be appreciated. Preferred solution in Java.

+3
source share
8 answers

Wikipedia XML Dams Java API: WikiXMLJ (Last updated in November 2010).
In addition, there is a live mirror , which is maven-compatible with some bug fixes.

+7
source

, XML Java, XML - , SAX, , DOM, .

, - ?

+4

java, wikipedia xml:
http://code.google.com/p/gwtwiki/. java- wikipedia xml html, pdf, text,...: http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

+3

, . DOM. POJO, XSLT-.

XML, POJO Castor/JAXB ( XML ojbect).

, , .

.

--- EDIt ---

. , STAX , .

http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html

http://tutorials.jenkov.com/java-xml/sax-vs-stax.html

+2

- xml, SAX. node ( DOM, ).

+1

StAX, , SAX ( ).

+1

, Wikipedia XML , Wiki Parser.

, , Java, , , XML .

, WikiParser 2-3 , .

0

, , wiki, https://github.com/Stratio/wikipedia-parser, . XML , .

, Scala:

val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))

  parser.getContentHandler.setRevisionCallback(new RevisionCallback {
  override def callback(revision: Revision): Unit = {
  val page = revision.getPage
  val title = page.getTitle
  val articleText =  revision.getText()
  println(articleText)
}

, , , (), , .:)

--- ---

I am currently working on https://github.com/idio/wiki2vec , which I think is part of the pipeline that you might need. Feel free to take a look at the code

0
source

Source: https://habr.com/ru/post/1746346/


All Articles