How to find Markdown links using regular expressions?

There are two ways to place a link in Markdown: just enter the source link, for example: http://example.com , and the other use the syntax ()[] : (Stack Overflow)[ http://example.com ] .

I am trying to write a regular expression that can match both of them, and if this is the second match, also display the display string.

So far I have this:

 (?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\]) 

Regular expression visualization

Demo version of Debuggex

But this does not seem to match any of my two test cases in Debuggex:

 http://example.com (Example)[http://example.com] 

Actually, I’m not sure why the first one doesn’t match, at least is it related to my use of the named group? Which, if possible, I would like to continue to use, because it is a simplified expression that corresponds to a link, and in a real example it is too comfortable for me to duplicate it in two different places in one template.

What am I doing wrong? Or is this not feasible at all?

EDIT: I do this in Python, so I will use their regex mechanism.

+6
source share
1 answer

The reason your template doesn’t work is here: (?<=\((.*)\)\[) , Because the Python re-module doesn’t allow you to search for variable lengths.

You can get what you want in a more convenient way using the new Python regular expression module (since the re module has several features in comparison).

Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])

online demo

more details:

 (?| # open a branch reset group # first case there is only the url (?<txt> # in this case, the text and the url (?<url> # are the same (?:ht|f)tps?://\S+(?<=\P{P}) ) ) | # OR # the (text)[url] format \( ([^)]+) \) # this group will be named "txt" too \[ (\g<url>) \] # this one "url" ) 

This template uses the reset function branch (?|...|...|...) , which allows you to save the names of capture groups (or numbers) in rotation. In the template, since the group ?<txt> opened first in the first interleave, the first group in the second member will have the same name automatically. Same for the group ?<url> .

\g<url> is a reference to a named subpattern ?<url> (for example, an alias, so you don't have to rewrite it in the second member.)

(?<=\P{P}) checks to see if the last character of the URL is not a punctuation character (useful, for example, to avoid the closing square bracket). (I'm not sure about the syntax, it could be \P{Punct} )

+6
source

Source: https://habr.com/ru/post/973294/


All Articles