I play with dbpedia extraction framework. It seems very enjoyable, and I happily create AST from Wikipedia pages and extract links (using WikiParser). However, although I get a beautiful structured tree from parsing, I notice that text nodes still contain a lot of formatting markup (for example, apostrophes used for iteration, bold, etc.). This is not useful for my purposes - I just want plain text.
I can spend some time creating my own code to rule this out, but I suppose something like that would be useful for dbpedia - and that it exists somewhere in the library. I'm right? And if so - where is the extra functionality to hide up to bare text?
Otherwise - does anyone know of any other (preferably scala) packages to cut out mediawiki markup?
Edit
In response to a request in more detail. The following markup:
''An italicised '''bit''' of text'', <b>Some markup</b>
Included in dbpedia as TextNode content, but intact. I would like this to be possible:
An italicised bit of text, Some markup
Or, perhaps, for a more structured AST with additional nodes representing each section of the raw text, perhaps an annotated (for each node) type of formatting used (for example, italic, bold, etc.).
, dbpedia - .
, .