How can I get the first and third td from a table using BeautifulSoup?

I am currently using Python and BeautifulSoup to clear some website data. I am trying to pull cells from a table that is formatted like this:

<tr><td>1<td><td>20<td>5%</td></td></td></td></tr> 

The problem with the above HTML is that BeautifulSoup reads it as a single tag. I need to infer the values ​​from the first <td> and third <td> , which will be 1 and 20 respectively.

Unfortunately, I have no idea how to do this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags for each row of the table?

Update:

I understood the problem. I used html.parser instead of the default for BeautifulSoup. As soon as I switched to default, the problems disappeared. I also used the method specified in the answer.

I also found out that different parsers are very temperamental with broken code. For example, the analyzer refused to read the previous line 192 by default, but html5lib completed the task. Try using lxml , html , and html5lib if you are having trouble parsing the entire table.

+4
source share
1 answer

This is the nasty piece of HTML you have. If we ignore the semantics of table rows and table cells for a moment and consider it as pure XML, its structure is as follows:

 <tr> <td>1 <td> <td>20 <td>5%</td> </td> </td> </td> </tr> 

BeautifulSoup, however, knows about the semantics of HTML tables and instead analyzes it as follows:

 <tr> <td>1 <!-- an IMPLICITLY (no closing tag) closed td element --> <td> <!-- as above --> <td>20 <!-- as above --> <td>5%</td> <!-- an EXPLICITLY closed td element --> </td> <!-- an error; ignore this --> </td> <!-- as above --> </td> <!-- as above --> </tr> 

... so, as you say, 1 and 20 are in the first and third td elements ( not tags ) respectively.

In fact, you can get the contents of these td elements as follows:

 >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>") >>> tr = soup.find("tr") >>> tr <tr><td>1</td><td></td><td>20</td><td>5%</td></tr> >>> td_list = tr.find_all("td") >>> td_list [<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>] >>> td_list[0] # Python starts counting list items from 0, not 1 <td>1</td> >>> td_list[0].text '1' >>> td_list[2].text '20' >>> td_list[3].text '5%' 
+9
source

Source: https://habr.com/ru/post/1496986/


All Articles