Python Beautifulsoup html parsing

Question

Python Beautifulsoup html parsing

I'm trying to make out

<td height="16" class="listtable_1"><a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">76561198134729239</a></td>

for 76561198134729239. And I can’t figure out how to do this. what i tried:

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td", 
{
    "class":"listtable_1",
    "target":"_blank"
})
print(element.text)

+4

python beautifulsoup request

nooby Jan 18 '17 at 13:36

source share

4 answers

"target":"_blank"is the class of the anchor tag ain the tag td. This is not a tag class td.

You can get it like this:

from bs4 import BeautifulSoup

html="""
<td height="16" class="listtable_1">
    <a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
        76561198134729239
    </a>
</td>"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)

Conclusion:

76561198134729239

+3

MYGz Jan 18 '17 at 13:43

source share

, find(). find(), MYGz, CSS:

soup.select_one("td.listtable_1 a[target=_blank]").get_text()

, select():

for elm in soup.select("td.listtable_1 a[target=_blank]"):
    print(elm.get_text())

+3

alecxe 18 . '17 13:51

source share

"class":"listtable_1"belong to the tag tdand target="_blank"belong to the tag a, you should not use them together.

you should use Steam Communityas a binding to find the numbers after it.

OR use a URL, the URL contains the necessary information and is easy to find, you can find the URL and divide it into /:

for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
    num = a['href'].split('/')[-1]
    print(num)

the code:

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
    num = td.find_next_sibling('td').text
    print(num)

of

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

+2

宏杰李 Jan 18 '17 at 13:44

source share

Martin evans · Accepted Answer · 2017-01-18T13:52:15+0000

There are many such entries in this HTML. To get all of them, you can use the following:

import requests
from lxml import html
from bs4 import BeautifulSoup

r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")

for td in soup.findAll("td", class_="listtable_1"):
    for a in td.findAll("a", href=True, target="_blank"):
        print(a.text)

Then it will return:

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

Python Beautifulsoup html parsing

More articles: