Python regular expression to search for content of MediaWiki markup links

If I have an xml containing things like the following mediawiki markup:

"... collected in the XII century, of which [[Alexander the Great]] was a hero in whom he was represented as British [[King Arthur | Arthur]]"

which would be suitable arguments for something like:

re.findall([[__?__]], article_entry)

I stumble a bit on avoiding double square brackets and getting the right link for the text, for example: [[Alexander of Paris|poet named Alexander]]

+3
source share
4 answers

Here is an example

import re

pattern = re.compile(r"\[\[([\w \|]+)\]\]")
text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas"
results = pattern.findall(text)

output = []
for link in results:
    output.append(link.split("|")[0])

# outputs ['Alexander of Paris']

Version 2 adds more to the regular expression, but as a result changes the output:

import re

pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)

# outputs [('a', '|b'), ('c', '|d'), ('efg', '')]

print [link[0] for link in results]

# outputs ['a', 'c', 'efg']

Version 3 if you only want a link without a title.

pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)

# outputs ['a', 'c', 'efg']
+5
source

RegExp:\w + (\ w +) + (? =]])

[[ , ]]

[[ ]]

+1
import re
pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])")
text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]"
results = pattern.findall(text)
print results

Gives a conclusion

["Alexander the Great", "King Arthur"]
+1
source

If you are trying to get all the links from a page, it is of course much easier to use the MediaWiki API, if at all possible, for example. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_ (website) .

Note that both of these methods skip links embedded in templates.

+1
source

Source: https://habr.com/ru/post/1707454/


All Articles