BeautifulSoup - How to get all the text between two different tags?
I would like to get all the text between two tags:
<div class="lead">I DONT WANT this</div> #many different tags - p, table, h2 including text that I want <div class="image">...</div> I started this way:
url = "http://......." req = urllib.request.Request(url) source = urllib.request.urlopen(req) soup = BeautifulSoup(source, 'lxml') start = soup.find('div', {'class': 'lead'}) end = soup.find('div', {'class': 'image'}) And I have no idea what to do next
try using the following code:
from bs4 import BeautifulSoup soup = BeautifulSoup(""" <html><div class="lead">lead</div>data<div class="end"></div></html>" """, "lxml") node = soup.find('div', {'class': 'lead'}) s = [] while True: if node is None: break node = node.next_sibling if hasattr(node, "attrs") and ("end" in node.attrs['class'] ): break else: if node is not None: s.append(node) print s using next_sibling to get brother node.
Try this code, it allows the initial starter to start the class and exit the program when the class image hits and print all available tags, this can be changed to print the entire code:
html = u"" for tag in soup.find("div", { "class" : "lead" }).next_siblings: if soup.find("div", { "class" : "image" }) == True: exit() else: html += unicode(tag) print html