I am trying to parse specific content on Wikipedia in a structured way. Here is an example page:
http://en.wikipedia.org/wiki/Polar_bear
I have some success. I can find that this page is a "specie" page, and I can also parse the Taxobox information (on the right) into the structure. So far so good.
However, I am also trying to parse paragraphs of text. They are returned by the API in Wiki or HTML format, I am currently working with a Wiki format.
I can read these paragraphs, but I would like to βclearβ them in a certain way, because in the end I will have to display it in my application, and it does not make sense of the Wiki markup. For example, I would like to delete all images. This is pretty easy by filtering out the [[Image:]] blocks. However, there are also blocks that I just cannot remove, for example:
{{convert | 350 | - | 680 | kg | abd = on}}
Removing all of this block will violate the sentence. And there are dozens of such designations that are of particular importance. I would like to avoid writing 100 regular expressions in order to handle all this and see how I can analyze it more intelligently.
My dilemma is this:
- I could continue my current path of semi-structured parsing, where I would have a lot of work, removing unnecessary elements, as well as facial expressions, templates that need to be visualized.
- Or, I could start by outputting the HTML output and parse it, but I'm worried that it is just as fragile and complex to parse in a structured way.
Ideally, a library exists to solve this problem, but I have not found it before this work. I also looked at structured Wikipedia databases, such as DBPedia, but they only have the same structures that I already have, they do not contain any structure in the Wiki text itself.
Ferdy source share