Python regex with wiki text

I am trying to change wikitext into plain text using Python regex substitution. There are two rules for formatting a wiki link.

  • [[Page Name]]
  • [[Page title | Text to display]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache.

The CD is composed almost entirely of [[cover versions]] of the songs [[The Beatles]] that George Martin [[record producer] created]].

The text above should be changed to:

The CD is composed almost entirely of cover versions of The Beatles songs originally produced by George Martin.

The conflict between [[]] and [[| ]] grammar is my main problem. I don't need one complicated regular expression. Using multiple (possibly two) regular expression substitutions in a sequence is normal.

Please enlighten me on this issue.

+3
source share
4 answers
wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz

Note. You can also find MediaWiki parsers at http://www.mediawiki.org/wiki/Alternative_parsers .

+7
source

. Wiki, , , , - , . Python, mwlib, :

http://code.pediapress.com/wiki/wiki/mwlib

+1

, . , - :

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, , !)

1 wiki. 4 , .

:

  • (([^\]|]|\](?=[^\]]))*) , "|" "]]". , , "|" "]" , "", , "]" .
  • (\|(([^\]]|\](?=[^\]]))*))? "|" , , . , "|" .
  • , \[\[... \]\].
  • The designation (?=...)matches the regular expression, but does not use its characters, so you can match them later. I use it so as not to consume "|" a character that may appear immediately after "]".

Edit : I fixed the regex to allow "]" just before "|" as in [[abcd]|efgh]].

0
source

This should work:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)
0
source

Source: https://habr.com/ru/post/1790606/


All Articles