Regular expression to highlight between two lines (which are variables)

I want to use regex to extract text that occurs between two lines. I know how to do this if I want to retrieve the same line every time (and countless questions asking this question, like Matching regular expressions between two lines? ), But I want to do this using variables that change, and may themselves include special characters in Regex. (I want any special characters like * to be treated as text).

For example, if I had:

text = "<b*>Test</b>" left_identifier = "<b*>" right_identifier = "</b> 

I need to create regex code that will run the following code:

 re.findall('<b\*>(.*)<\/b>',text) 

This is the part <b\*>(.*)<\/b> that I don’t know how to dynamically create.

+6
source share
4 answers

You need re.escape identifiers:

 >>> regex = re.compile('{}(.*){}'.format(re.escape('<b*>'), re.escape('</b>'))) >>> regex.findall('<b*>Text</b>') ['Text'] 
+4
source

You can do something like this:

 import re pattern_string = re.escape(left_identifier) + "(.*?)" + re.escape(right_identifier) pattern = re.compile(pattern_string) 

The escape function will automatically exit special characters. For instance,

 >>> import re >>> print re.escape("<b*>") \<b\*\> 
+5
source

A regular expression starts its life as a string, so left_identifier + text + right_identifier and use this in re.compile

Or:

 re.findall('{}(.*){}'.format(left_identifier, right_identifier), text) 

works too.

You need to avoid strings in variables if they contain the regex metacharacter with re.escape unless you want the metacharacters to be interpreted as such

 >>> text = "<b*>Test</b>" >>> left_identifier = "<b*>" >>> right_identifier = "</b>" >>> s='{}(.*?){}'.format(*map(re.escape, (left_identifier, right_identifier))) >>> s '\\<b\\*\\>(.*?)\\<\\/b\\>' >>> re.findall(s, text) ['Test'] 

Str.partition (var) , on the other hand, is an alternative way to do this:

 >>> text.partition(left_identifier)[2].partition(right_identifier)[0] 'Test' 
+4
source

I know that you really need a regular expression solution, but I really wonder if regex is the right tool here, considering that we all took the oath not before . When parsing html strings, I always recommend returning to beautifulsoup

 >>> import bs4 >>> bs4.BeautifulSoup('<b*>Text</b>').text u'Text' 
0
source

Source: https://habr.com/ru/post/985201/


All Articles