Dbpedia extraction basics - how to break MediaWiki formatting markup

I play with dbpedia extraction framework. It seems very enjoyable, and I happily create AST from Wikipedia pages and extract links (using WikiParser). However, although I get a beautiful structured tree from parsing, I notice that text nodes still contain a lot of formatting markup (for example, apostrophes used for iteration, bold, etc.). This is not useful for my purposes - I just want plain text.

I can spend some time creating my own code to rule this out, but I suppose something like that would be useful for dbpedia - and that it exists somewhere in the library. I'm right? And if so - where is the extra functionality to hide up to bare text?

Otherwise - does anyone know of any other (preferably scala) packages to cut out mediawiki markup?

Edit

In response to a request in more detail. The following markup:

''An italicised '''bit''' of text'', <b>Some markup</b>

Included in dbpedia as TextNode content, but intact. I would like this to be possible:

 An italicised bit of text, Some markup

Or, perhaps, for a more structured AST with additional nodes representing each section of the raw text, perhaps an annotated (for each node) type of formatting used (for example, italic, bold, etc.).

, dbpedia - .

, .

+3
3

, SimpleWikiParser sourceforge , 1/29/2011 :

  • .

, wiki TextNode. wiki-, , , .

Alternate Parsers.

, node.text.

+2

gwtwiki (bliki) MediaWiki → pdf/html/etc. .

+1

You can start this process using WikiUtil.removeWikiEmphasis and adding a few additional rules.

In my case, I map text to toWikiText and associate nodes with their destination name.

case text:TextNode => text.toWikiText
case link:LinkNode => {
link match {
   case external:ExternalLinkNode =>  (external.destination.toString)
   case internal:InternalLinkNode =>  (internal.destination.decodedWithNamespace)
   case inter:InterWikiLinkNode   =>  (inter.destination.decodedWithNamespace)
}
0
source

Source: https://habr.com/ru/post/1796327/


All Articles