Python RE findall () return value - whole string

I am writing a scanner to get certain parts of an html file. But I can't figure out how to use re.findall ().

Here is an example when I want to find the whole ... part in a file, I can write something like this:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

['<div> </div> <div> </div>']

Only a whole line. This is not what I want, I expect the two divs to be separate. What should I do?

+4
source share
2 answers

Quote documentation ,

'*', '+' '?' ; . '?' , ; .

:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

, RegEx HTML, HTML. BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
+6

* , *? non -greedy.

re.findall("<div>.*?</div>", result_page)

, BeautifulSoup :

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')
+4

Source: https://habr.com/ru/post/1584698/


All Articles