Python RE findall () return value - whole string

Question

I am writing a scanner to get certain parts of an html file. But I can't figure out how to use re.findall ().

Here is an example when I want to find the whole ... part in a file, I can write something like this:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

['<div> </div> <div> </div>']

Only a whole line. This is not what I want, I expect the two divs to be separate. What should I do?

+4

alvinzoo Apr 26 '15 at 4:29

2 answers

vaultah · Answer 1 · 2015-04-26T04:31:55+0000

'*', '+' '?' ; . '?' , ; .

:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

, RegEx HTML, HTML. BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

hwnd · Answer 2 · 2015-04-26T04:32:04+0000

* , *? non -greedy.

re.findall("<div>.*?</div>", result_page)

, BeautifulSoup :

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')