What is the best way to parse Wikipedia markup in PHP?

I am trying to parse specific content on Wikipedia in a structured way. Here is an example page:

http://en.wikipedia.org/wiki/Polar_bear

I have some success. I can find that this page is a "specie" page, and I can also parse the Taxobox information (on the right) into the structure. So far so good.

However, I am also trying to parse paragraphs of text. They are returned by the API in Wiki or HTML format, I am currently working with a Wiki format.

I can read these paragraphs, but I would like to β€œclear” them in a certain way, because in the end I will have to display it in my application, and it does not make sense of the Wiki markup. For example, I would like to delete all images. This is pretty easy by filtering out the [[Image:]] blocks. However, there are also blocks that I just cannot remove, for example:

{{convert | 350 | - | 680 | kg | abd = on}}

Removing all of this block will violate the sentence. And there are dozens of such designations that are of particular importance. I would like to avoid writing 100 regular expressions in order to handle all this and see how I can analyze it more intelligently.

My dilemma is this:

  • I could continue my current path of semi-structured parsing, where I would have a lot of work, removing unnecessary elements, as well as facial expressions, templates that need to be visualized.
  • Or, I could start by outputting the HTML output and parse it, but I'm worried that it is just as fragile and complex to parse in a structured way.

Ideally, a library exists to solve this problem, but I have not found it before this work. I also looked at structured Wikipedia databases, such as DBPedia, but they only have the same structures that I already have, they do not contain any structure in the Wiki text itself.

+4
source share
1 answer

Too many templates to use to override all of them manually, and they change all the time. So, you will need the actual wiki syntax syntax that can handle all the templates.

And the wiki syxtax is quite complex, has a lot of quirks and a formal specification. This means that creating your own analyzer will be too big, you should use it in MediaWiki.

Because of this, I think that your best analysis will receive parsed HTML through the MediaWiki API .

One thing that is probably easier to deal with wiki markup is the info box, so maybe this should be a special case.

+3
source

Source: https://habr.com/ru/post/1387859/


All Articles