, , wiki, https://github.com/Stratio/wikipedia-parser, .
XML , .
, Scala:
val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))
parser.getContentHandler.setRevisionCallback(new RevisionCallback {
override def callback(revision: Revision): Unit = {
val page = revision.getPage
val title = page.getTitle
val articleText = revision.getText()
println(articleText)
}
, , , (), , .:)
--- ---
I am currently working on https://github.com/idio/wiki2vec , which I think is part of the pipeline that you might need. Feel free to take a look at the code
source
share