How to get value between two different tags using a beautiful soup?

I need to extract the data present between the end tag and the tag
in the code snippet below:

<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td> 

What I need: W, 65, 3

But the problem is that these values โ€‹โ€‹can also be empty, for example -

 <td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td> 

I want to get these values โ€‹โ€‹if there is still an empty string

I tried using nextSibling and find_next ('br') , but it returned

  <br><b>Second Type :</b><br><b>Third Type :</b></br></br> 

and

 <br><b>Third Type :</b></br> 

if values โ€‹โ€‹(W, 65, 3) are missing between tags

 </b> and <br> 

All I need is that it should return an empty string if there is nothing between these tags.

+6
source share
4 answers

I would use the <b> tag </b> to see what type of information their next_sibling contains.

I would just check if their next_sibling.string not None , and add a list accordingly :)

 >>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>""" >>> soup = BeautifulSoup(html, "html.parser") >>> b = soup.find_all("b") >>> data = [] >>> for tag in b: if tag.next_sibling.string == None: data.append(" ") else: data.append(tag.next_sibling.string) >>> data [' ', u'65', u'3'] # Having removed the first string 

Hope this helps!

+4
source

I would look for a td object and then use the regex template to filter the data you need, instead of using re.compile in the find_all method.

Like this:

 import re from bs4 import BeautifulSoup example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td> <td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>""" soup = BeautifulSoup(example, "html.parser") for o in soup.find_all('td'): match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o)) print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0])) 

This template finds all text between </b> tags and <br> or </br> . Tags </br> are added when converting a soup object to a string.

This example displays:

Tue, 65.3

69.6

Just an example, you can change to return an empty string if one of the regular expression matches is empty.

+1
source
 In [5]: [child for child in soup.td.children if isinstance(child, str)] Out[5]: ['W', '65', '3'] 

These texts and tags are children of td, you can access them using contents (list) or children (generator)

 In [4]: soup.td.contents Out[4]: [<b>First Type :</b>, 'W', <br/>, <b>Second Type :</b>, '65', <br/>, <b>Third Type :</b>, '3'] 

then you can get the text by checking if this is an instance of str

+1
source

I think this works:

 from bs4 import BeautifulSoup html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>''' soup = BeautifulSoup(html, 'lxml') td = soup.find('td') string = str(td) list_tags = string.split('</b>') list_needed = [] for i in range(1, len(list_tags)): if list_tags[i][0] == '<': list_needed.append('') else: list_needed.append(list_tags[i][0]) print(list_needed) #['W', '65', '3'] 

Since the values โ€‹โ€‹you want are always after the end of the tags, it's easy to catch them that way, no need to reuse.

0
source

Source: https://habr.com/ru/post/1015400/


All Articles