How to get value between two different tags using a beautiful soup?

Question

How to get value between two different tags using a beautiful soup?

I need to extract the data present between the end tag and the tag
in the code snippet below:

<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>

What I need: W, 65, 3

But the problem is that these values can also be empty, for example -

 <td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>

I want to get these values if there is still an empty string

I tried using nextSibling and find_next ('br') , but it returned

  <br><b>Second Type :</b><br><b>Third Type :</b></br></br>

and

 <br><b>Third Type :</b></br>

if values (W, 65, 3) are missing between tags

 </b> and <br>

All I need is that it should return an empty string if there is nothing between these tags.

+6

python html-parsing beautifulsoup

utkarsh awasthi Mar 2 '17 at 11:30

source share

4 answers

I would look for a td object and then use the regex template to filter the data you need, instead of using re.compile in the find_all method.

Like this:

 import re from bs4 import BeautifulSoup example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td> <td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>""" soup = BeautifulSoup(example, "html.parser") for o in soup.find_all('td'): match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o)) print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))

This template finds all text between  tags and   or  . Tags  are added when converting a soup object to a string.

This example displays:

Tue, 65.3
69.6

Just an example, you can change to return an empty string if one of the regular expression matches is empty.

+1

Zroq Mar 2 '17 at 12:14

source share

 In [5]: [child for child in soup.td.children if isinstance(child, str)] Out[5]: ['W', '65', '3']

These texts and tags are children of td, you can access them using contents (list) or children (generator)

 In [4]: soup.td.contents Out[4]: [<b>First Type :</b>, 'W', <br/>, <b>Second Type :</b>, '65', <br/>, <b>Third Type :</b>, '3']

then you can get the text by checking if this is an instance of str

+1

宏杰李 Mar 03 '17 at 1:32

source share

I think this works:

 from bs4 import BeautifulSoup html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>''' soup = BeautifulSoup(html, 'lxml') td = soup.find('td') string = str(td) list_tags = string.split('</b>') list_needed = [] for i in range(1, len(list_tags)): if list_tags[i][0] == '<': list_needed.append('') else: list_needed.append(list_tags[i][0]) print(list_needed) #['W', '65', '3']

Since the values you want are always after the end of the tags, it's easy to catch them that way, no need to reuse.

0

leite0407 Mar 2 '17 at 12:18

source share

DMPierre · Accepted Answer · 2017-03-02T12:59:30+0000

I would use the  tag  to see what type of information their next_sibling contains.

I would just check if their next_sibling.string not None , and add a list accordingly :)

 >>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>""" >>> soup = BeautifulSoup(html, "html.parser") >>> b = soup.find_all("b") >>> data = [] >>> for tag in b: if tag.next_sibling.string == None: data.append(" ") else: data.append(tag.next_sibling.string) >>> data [' ', u'65', u'3'] # Having removed the first string

Hope this helps!

How to get value between two different tags using a beautiful soup?

More articles: